LLM Cookbook for Psychometricians – Jihong Zhang, Ph.D.

Notice

This is a draft document. The content is not final and may change in the future. Please check back later for updates.

Figure 1: The generated Likert-scale motivation item by GPT-4.1

1 GPT-4.1

OpenAI published GPT-4.1 (Model: gpt-4.1-2025-04-14) on April 2025. The price is $2 / 1M token. As a reference, in my recent project of using LLM to generate survey responses, the number of tokens per prompt ranges from 5K to 10K, which means 100 calls cost around $2 or $.02 / calling. The question is how GPT-4.1 performs compared to GPT-4o regarding survey response generation. I will use two mini studies to examine the differences between GPT-4o and GPT-4.1 when answering to the same prompt.

1.1 Mini Study 1: Likert-scale item response generation

In study 1, the prompt is a general prompt from my recent project: Survey data generation for after-school program staffs by large language models. Here is the Role & Task that I used in the prompt:

You are a staff member of club serving underserved communities. The program enrolls at least 50% of youth (ages 9–14) from low-income households (as defined by free/reduced lunch status) and minority backgrounds.

Your task is to construct a persona. Based on this persona, generate numerical responses to survey items in Survey Questions section using a Likert scale in Response Scale section that aligns with your expressed views.

The simulated survey is a validated survey of motivation for exercise. The survey items are 15 Likert-scale items with 6 response options (1 = Strongly disagree, 2 = Disagree, 3 = Somewhat disagree, 4 = Somewhat agree, 5 = Agree, 6 = Strongly agree).

Result

Step 1: Create meta prompts without personal information

instruction <- "
Instructions:

For your output,
  - Only provide the numerical responses.
  - Format your responses as a semicolon-separated list: X;X;X;X;X;X;X, where X represents your selected option for each question, the total number of X depends on number of items.
  - Example: '1;3;2' corresponds to:
    - 1 (Strongly Agree) for the first question
    - 3 (Somewhat Agree) for the second question
    - 2 (Agree) for the third question
    
The scale for all items is: 1 = Strongly disagree, 2 = Disagree, 3 = Somewhat disagree, 4 = Somewhat agree, 5 = Agree, 6 = Strongly agree.      
"
prompt <- "
Role & Task:

You are a staff member of club serving underserved communities. The program enrolls at least 50% of youth (ages 9–14) from low-income households (as defined by free/reduced lunch status) and minority backgrounds.

Your task is to construct a persona. Based on this persona, generate numerical responses to survey items in Survey Questions section using a Likert scale in Response Scale section that aligns with your expressed views.

Survey Questions:
Item1: I exercise because other people say I should.
Item2: I take part in exercise because my friends/family/spouse say I should.
Item3: I exercise because others will not be pleased with me if I don't.
Item4: I feel under pressure from my friends/family to exercise.
Item5: I feel guilty when I don't exercise.
Item6: I feel ashamed when I miss an exercise session.
Item7: I feel like a failure when I haven't exercised in a while.
Item8: I value the benefits of exercise.
Item9: It's important to me to exercise regularly.
Item10: I think it is important to make the effort to exercise regularly.
Item11: I get restless if I don't exercise regularly.
Item12: I exercise because it's fun.
Item13: I enjoy my exercise sessions.
Item14: I find exercise a pleasurable activity.
Item15: I get pleasure and satisfaction from participating in exercise
"

Step 2: Python functions to call GPT API

import time  # Import the time module
from dis import Instruction
import os
import csv
from openai import OpenAI

# ChatGPT
def get_response_gpt4o(instructions, input, temp):
    # Initialize the OpenAI API client
    client = OpenAI()
    response = client.responses.create(
        model="gpt-4o",
        instructions=instructions,
        input=input,
        temperature=temp,
    )
    return response.output_text

def get_response_gpt41(instructions, input, temp):
    # Initialize the OpenAI API client
    client = OpenAI()
    response = client.responses.create(
        model="gpt-4.1-2025-04-14",
        instructions=instructions,
        input=input,
        temperature=temp,
    )
    return response.output_text

Step 3: Data visualization for LLM-generated responses

gpt4o_resp <- py$get_response_gpt4o(instructions=instruction, 
                                    input=prompt, temp = 0)
gpt41_resp <- py$get_response_gpt41(instructions=instruction, 
                                    input=prompt, temp = 0)

res <- data.frame(
  gpt4o = unlist(str_split(gpt4o_resp, ";")),
  gpt41 = unlist(str_split(gpt41_resp, ";")),
  Item = factor(1:15)
) |> 
  pivot_longer(starts_with("gpt"), names_to = "Model", values_to = "Response")

ggplot(res) +
  geom_path(aes(x = Item, y = Response, color = Model, group = Model) ) +
  scale_color_manual(values = c("royalblue", "red3"), 
                     labels = c("GPT4.1", "GPT4o")) +
  theme_classic()

Figure 2: Generated survey output for GPT-4o and GPT-4.1

As shown in Figure 2, without providing any context information, two items show differences in generated survey responses for GPT-4o and GPT-4.1. Looking these two items into details, it shows that the difference may come from the strong emotional words such as “ashamed” or “failure”.

Item6: I feel ashamed when I miss an exercise session.
Item7: I feel like a failure when I haven’t exercised in a while.

In contrast to choosing “Somewhat disagree” by GPT-4o, GPT-4.1 tend to choose “Disagree” in these two items with negatively emotional words. Does GPT-4.1 tend to be more emotionally similar to humans — mentally avoiding negative emotion?

1.2 Mini Study 2: Probability-based generation

Instead of asking LLM to generate survey responses directly, we can also instruct LLMs to generate probability distributions of responses for each item. For example, instead of letting LLM output “1;2;2” for three 5-point Liker-scale items, we ask LLMs to output Xi_k1:pi1,Xi_k2:pi2,Xi_k3:pi3,Xi_k4:pi4,Xi_k5:pi5,Xi_k6:pi6;.

where:

Xi_kj denotes response option j (from 1 to 6) for item i
pij is the probability (between 0 and 1) of selecting that option.
Each line (separated by semicolons) corresponds to a different item
The probabilities for each item must sum to 1

Result

Use response probabities for generated responses

instruction2 <- "
Instructions:
Generate probabilistic responses for each survey item using the following format. Each response should represent a probability distribution across six ordinal options of 15 items. The pribabilities of all options of each item should be two or three decimal places. Ensure that for each item, the sum of all six probabilities equals 1.

- Output format:

Provide only numerical responses. Use a semicolon-separated list to indicate the probability of selecting each response option for each item.

- Structure:

For Item i, the response should follow this pattern:

Xi_k1:pi1,Xi_k2:pi2,Xi_k3:pi3,Xi_k4:pi4,Xi_k5:pi5,Xi_k6:pi6;

where:
  - Xi_kj denotes response option j (from 1 to 6) for item i
  - pij is the probability (between 0 and 1) of selecting that option. 
  - Each line (separated by semicolons) corresponds to a different item
  - The probabilities for each item must sum to 1

- Example:
  
  X1_k1:0.50,X1_k2:0.10,X1_k3:0.10,X1_k4:0.10,X1_k5:0.10,X1_k6:0.10;

This indicates that for Item 1, the probabilities of selecting options 1 through 6 are 50%, 10%, 10%, 10%, 10%, and 10%, respectively.

- Scale (applies to all items):
  1 = Strongly Disagree
  2 = Disagree
  3 = Somewhat Disagree
  4 = Somewhat Agree
  5 = Agree
  6 = Strongly Agree
"

gpt4o_resp2 <- py$get_response_gpt4o(instructions=instruction2, 
                                    input=prompt, temp = 0)
gpt41_resp2 <- py$get_response_gpt41(instructions=instruction2, 
                                    input=prompt, temp = 0)

You can find that GPT models can understand the prompt very well

cat(gpt4o_resp2)

X1_k1:0.300,X1_k2:0.250,X1_k3:0.200,X1_k4:0.150,X1_k5:0.050,X1_k6:0.050;
X2_k1:0.250,X2_k2:0.200,X2_k3:0.200,X2_k4:0.150,X2_k5:0.100,X2_k6:0.100;
X3_k1:0.350,X3_k2:0.250,X3_k3:0.150,X3_k4:0.100,X3_k5:0.075,X3_k6:0.075;
X4_k1:0.300,X4_k2:0.250,X4_k3:0.200,X4_k4:0.100,X4_k5:0.075,X4_k6:0.075;
X5_k1:0.200,X5_k2:0.200,X5_k3:0.200,X5_k4:0.150,X5_k5:0.125,X5_k6:0.125;
X6_k1:0.250,X6_k2:0.200,X6_k3:0.200,X6_k4:0.150,X6_k5:0.100,X6_k6:0.100;
X7_k1:0.300,X7_k2:0.250,X7_k3:0.150,X7_k4:0.100,X7_k5:0.100,X7_k6:0.100;
X8_k1:0.050,X8_k2:0.050,X8_k3:0.100,X8_k4:0.200,X8_k5:0.300,X8_k6:0.300;
X9_k1:0.050,X9_k2:0.050,X9_k3:0.100,X9_k4:0.200,X9_k5:0.300,X9_k6:0.300;
X10_k1:0.050,X10_k2:0.050,X10_k3:0.100,X10_k4:0.200,X10_k5:0.300,X10_k6:0.300;
X11_k1:0.100,X11_k2:0.100,X11_k3:0.150,X11_k4:0.200,X11_k5:0.225,X11_k6:0.225;
X12_k1:0.100,X12_k2:0.100,X12_k3:0.150,X12_k4:0.200,X12_k5:0.225,X12_k6:0.225;
X13_k1:0.050,X13_k2:0.050,X13_k3:0.100,X13_k4:0.200,X13_k5:0.300,X13_k6:0.300;
X14_k1:0.050,X14_k2:0.050,X14_k3:0.100,X14_k4:0.200,X14_k5:0.300,X14_k6:0.300;
X15_k1:0.050,X15_k2:0.050,X15_k3:0.100,X15_k4:0.200,X15_k5:0.300,X15_k6:0.300;

Let’s do some data cleaning and then visualize the item response probabilities.

Visualize the response probability for two LLM mdoels

get_df_from_output <- function(raw_gpt_op) {
  all_item_resp_name <- NULL
  all_item_resp_prob <- NULL
  item_resps <- str_split(raw_gpt_op, "\n")[[1]]
  for (i in 1:length(item_resps)) { # i = 1
    each_resp_foritem <- str_split(str_remove(item_resps[i], ";"), ",")[[1]]
    for (r in 1:length(each_resp_foritem)) { # for each response, r = 1
      each_resp <- str_split(each_resp_foritem[r], ":")[[1]]
      each_resp_name <- each_resp[1]
      each_resp_prob <- as.numeric(each_resp[2])
      all_item_resp_name <- c(all_item_resp_name, each_resp_name)
      all_item_resp_prob <- c(all_item_resp_prob, each_resp_prob)
    }
  }
  res <- data.frame(
    item_resps = all_item_resp_name,
    probability = all_item_resp_prob
  ) |> 
    separate(item_resps, into = c("Item", "Option"), sep = "_") |> 
    mutate(Option = factor(str_remove(Option, "k"), levels = 1:6),
           Item = factor(Item, levels = paste0("X", 15:1)))
  return(res)
}

gpt4o_resp2_dt <- get_df_from_output(gpt4o_resp2)
gpt41_resp2_dt <- get_df_from_output(gpt41_resp2)

dt_comb <- rbind(
  gpt4o_resp2_dt |> mutate(Model = "GPT-4o"),
  gpt41_resp2_dt |> mutate(Model = "GPT-4.1")
) 

mycolors = c("#4682B4", "#B4464B", "#B4AF46", 
             "#1B9E77", "#D95F02", "#7570B3",
             "#E7298A", "#66A61E", "#B4F60A")

ggplot(dt_comb) +
  geom_col(aes(x=Item, y = probability, fill = Option), 
           position =  position_stack(reverse = TRUE)) +
  geom_text(aes(x=Item, y = probability, 
                label = probability, group = Item),
            size = 2,
            position = position_stack(reverse = TRUE, vjust = 0.5)) +
  facet_wrap(~ Model) +
  coord_flip() +
  scale_fill_manual(values = mycolors)

Figure 3: Item response probabity distribution for GPT-4o and GPT-4.1

As Figure 3 shows, GPT-4.1 has varied item probability in Item 6 and Item 7. Taking item 7 for example, choosing 2 (Disagree) and 3 (Somewhat disagree) has same probability (p = .22) for GPT-4.1, while GPT-4o tend to have more in 1-Strongly disagree (p = .3) than 2-Disagree (p = .25) and 1-Somewhat disagree (p = .2).

In addition, GPT-4o seems to have exactly same item response distribution for item 8-10 and item 12-15. In contrast, GPT-4.1 has different item response distribution for these items.

Takehome note

GPT-4.1 is more sensitive to emotional words than GPT-4o.
GPT-4.1 is more likely to have different response probability distribution across items that GPT-4o, suggesting GPT-4.1 can have more diversity in generation item responses.

2 Response API Walk-through

As shown in Figure 1, OpenAI provides Prompts Playground to test your prompt. OpenAI has recently updated their API from Chat to Response API.

To use the Response API, update OpenAI python module to the latest using the following command:

pip install openai --upgrade

Responses vs. Chat Completions

For OpenAI, there are two types of API interfaces (see Responses vs. Chat Completions). Previously, I use client.chat.completions for example. In the following example, I will use client.responses.

import os
from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="gpt-4o",
    instructions="You are a coding assistant that talks like a pirate.",
    input="How do I check if a Python object is an instance of a class?",
)

print(response.output_text)

Arrr, matey! To see if yer Python object be an instance of a class, ye use the trusty `isinstance()` function. Here’s how ye do it:

```python
if isinstance(your_object, YourClass):
    print("Aye, it be the class instance ye be lookin' for!")
else:
    print("Nay, it ain't the right class.")
```

Just replace `your_object` with the object ye be checkin', and `YourClass` with the class itself. Smooth sailin', savvy?

2.1 Image example

Response API with image as input

from openai import OpenAI

client = OpenAI()
response = client.responses.create(
    model="gpt-4o",
    input=[
        {"role": "user", "content": "Describe the person  in the image using 5-6 sentences?"},
        {
            "role": "user",
            "content": [
                {
                    "type": "input_image",
                    "image_url": "https://campusdata.uark.edu/resources/images/FacultyStaffProfile/jzhang.jpg?1743825231865&w=440",
                }
            ],
        },
    ],
)


print(response.output_text)

The person in the image appears to be a professional, dressed in a gray pinstripe suit with a light-colored shirt. They have short, dark hair and are wearing glasses with a thin frame. Their expression is calm and composed, suggesting confidence and approachability. The background is plain and neutral, focusing attention on the individual. Overall, the person's attire and demeanor convey a sense of professionalism and seriousness.

Use AI to reproduce academic figure

Let’s try a more academic figure — a histogram plot from the web:

LLM-generated R code

from openai import OpenAI

client = OpenAI()
response = client.responses.create(
    model="gpt-4o",
    input=[
        {"role": "user", "content": "You are a R programming assistant. Provide R code to generate this figure. Make sure only R code is provide."},
        {
            "role": "user",
            "content": [
                {
                    "type": "input_image",
                    "image_url": "https://seaborn.pydata.org/_images/distributions_3_0.png",
                }
            ],
        },
    ],
)


print(response.output_text)

```r
# Load necessary libraries
library(ggplot2)
library(palmerpenguins)

# Create the histogram
ggplot(penguins, aes(x = flipper_length_mm)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "black", alpha = 0.7) +
  labs(x = "flipper_length_mm", y = "Count") +
  theme_minimal(base_size = 15) +
  theme(panel.grid.major = element_line(color = "white"))
```

Original Plot

AI-generated Plot

R Code

# Required packages
library(ggplot2)
library(palmerpenguins)

# Plot
ggplot(penguins, aes(x = flipper_length_mm)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "black", alpha = 0.6) +
  theme_minimal(base_size = 15) +
  labs(x = "flipper_length_mm", y = "Count")

2.2 Conversation State

Resource: https://platform.openai.com/docs/guides/conversation-state?api-mode=responses

The idea is creating a dynamic R vectors. The appended R string can change LLM conversation state including history.

In OpenAI, the previous_response_id parameter was used to chain responses and create a threaded conversation.

from openai import OpenAI

client = OpenAI()
def get_conversation(history): 
  response = client.responses.create(
      model="gpt-4o-mini", input=history, store=False
  )
  
  history += [
    {"role": output.role, "content": output.content} for output in response.output]
    
  return history

history = [{"role": "user", "content": "Can your generate one 6-Likert scale item response with item difficult as 2 for a persoon with 1 standard deviation factor score?"}]
res = get_conversation(history)
print(res[1]['content'][0].text)

Certainly! A 6-point Likert scale item could measure something like agreement, satisfaction, or frequency. Given that the item difficulty is set at 2 (on a scale where 1 is the easiest and 6 is the hardest), we can assume that the item is somewhat challenging, but not overly so.

Here’s an example of an item:

Item: “I find it easy to stay organized in my daily tasks.”

Response Options: 1. Strongly Disagree 2. Disagree 3. Neutral 4. Agree 5. Strongly Agree 6. Not Applicable

For a person with a standard deviation factor score of 1, we would expect their response to reflect that they find the item somewhat challenging but manageable. They might respond as follows, assuming a moderate agreement:

Response: “Agree” (4)

This response indicates that they find staying organized generally attainable, aligning their confidence level with the difficulty indicated by the item.

# A second response will be like:
res.append({"role": "user", "content": "Can your generate another 6-Likert scale item response with item difficult as 2 and item discrimination as 1 for a persoon with 2 standard deviation factor score?"})
res2 = get_conversation(res)
print(res2[3]['content'][0].text)

Sure! Given the item difficulty of 2 and a discrimination of 1, we can create a Likert scale item that is moderately easy but effectively distinguishes between individuals’ abilities or attitudes.

2.3 Example Item:

Item: “I often feel motivated to complete my goals.”

Response Options: 1. Strongly Disagree 2. Disagree 3. Neutral 4. Agree 5. Strongly Agree 6. Not Applicable

For a person with a factor score of 2 standard deviations above the mean, we would expect a strong agreement with the item due to their high motivation level. Therefore, they might respond as follows:

Response: “Strongly Agree” (5)

This response reflects their high motivation, effectively distinguishing them from others on this metric.

3 Example 1: Construct LLM function calling with Python and R

The following section is inspired by the fascinating guide in the post authored by Pavel Bazin.

First of all, type in the following bash command in the terminal to install openai python module if you already have Python installed.

pip install openai
python3 -m pip install openai
pip install openai --upgrade

The next step is to either create a new Python file (.py) or use {python} code chunk in Quarto, in which importing the OpenAI module and construct a python function called llm_eval(). This function can call OpenAI ChatGPT (e.g., gpt-4o) via the API key:

# Further down these imports will be ommited for brevity
import os
from openai import OpenAI


def llm_eval(prompt: str, message: str, model: str = "gpt-4o"):
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    messages = [
        {"role": "system", "content": prompt},
        {"role": "user", "content": message},
    ]

    return client.chat.completions.create(
        model=model,
        messages=messages
    )

Then, we should be able to call the function either in R (use reticulate package) or in Python:

R
Python

library(reticulate)

prompt = "
You are a data parsing assistant. 
User provides a list of groceries. 
Your goal is to output it as JSON.
"
message = "I'd like to buy some bread, pack of eggs, few apples, and a bottle of milk."

res = py$llm_eval(prompt=prompt, message=message)

json_data = res$choices[[1]]$message$content

cat(json_data)

```json
{
  "groceries": [
    "bread",
    "pack of eggs",
    "apples",
    "bottle of milk"
  ]
}
```

There are two things that should be noted:

If you are using Quarto, use py$<python object> in R code chunk to have access to the python object. If you put python code and R code in multiple files, use source_python("<Path to Python file>") to load the python function into R session.
In contrast to use res.choices[0].message.content in Python, note that the nested list in R is extract using $ with the starting index as 1.

prompt = """
You are a data parsing assistant. 
User provides a list of groceries. 
Your goal is to output it as JSON.
"""
message = "I'd like to buy some bread, pack of eggs, few apples, and a bottle of milk."

res = eval(prompt=prompt, message=message)
json_data = res.choices[0].message.content

print(json_data)

3.1 `response_format`: Structured Response

LLM returned a markdown formatted string containing JSON by default. The reason is that we didn’t enable structured output in the API call.

To return the plain JSON text, we can structure output in Python function through setting up response_format:

def llm_eval2(prompt: str, message: str, model: str = "gpt-4o"):
    
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    messages = [
        {"role": "system", "content": prompt},
        {"role": "user", "content": message},
    ]
    
    return client.chat.completions.create(
        model=model,
        messages=messages,
        # Enable structured output capability
        response_format={"type": "json_object"},
    )

# A R function to request LLM's response given prompt and message.
get_response <- function(prompt, message) {
  res <- py$llm_eval2(prompt=as.character(prompt), 
                      message=as.character(message))
  output <- res$choices[[1]]$message$content
  return(output)
}

Now, running the same code will return plain JSON. That is great not only because we don’t need to parse anything extra, but it also guarantees that the LLM won’t include any free-form text such as “Sure, here is your data! {…}”.

json_data2 = get_response(prompt=prompt, message=message)

cat(json_data2)

{
  "groceries": [
    "bread",
    "pack of eggs",
    "apples",
    "bottle of milk"
  ]
}

We can further transform the JSON text into multiple format:

Data Frame
List
Vector

list_data2 <- jsonlite::fromJSON(json_data2)

Data Frame
Vector

kable(as.data.frame(list_data2))

groceries
bread
pack of eggs
apples
bottle of milk

list_data2$groceries

[1] "bread"          "pack of eggs"   "apples"         "bottle of milk"

3.2 Structure as a HTML list

To further structure the output as HTML list (e.g., <ul></ul>), we can construct a simple sparser in Python and print it in R like this:

import json
def render(data: str):
    data_dict = json.loads(data)
    backslash ="\n\t"

    return f"""
    <ul>
        {"".join([ f"<li>{x}</li>" for x in data_dict["groceries"]])}
    </ul>
    """

library(htmltools)
HTML(py$render(json_data2))

bread
pack of eggs
apples
bottle of milk

3.3 Utilize Schema to Define Data Shape

The problem is, we don’t have the data shape defined; let’s call it JSON schema. Our schema is now up to the LLM, and it might change based on user input. Let’s rephrase the user query to see it in action. Instead of asking, “I’d like to buy some bread, a pack of eggs, a few apples, and a bottle of milk,” let’s ask, “12 eggs, 2 bottles of milk, 6 sparkling waters.”

message = "12 eggs, 2 bottles of milk, 6 sparkling waters"

res = py$llm_eval2(prompt=prompt, message=message)

json_data = res$choices[[1]]$message$content

cat(json_data)

{
  "groceries": [
    {
      "item": "eggs",
      "quantity": 12
    },
    {
      "item": "bottles of milk",
      "quantity": 2
    },
    {
      "item": "sparkling waters",
      "quantity": 6
    }
  ]
}

As seen in the output, ChatGPT can incorporate inputs with quantities into outputs with extra quantity attribute.

OpenAI introduced the next step for Structured Output. You can get the same results using response_format={"type": "json_object"} and parse data yourself without using beta version of API which was not performing reliably in our products.

For the following experiment, let’s create two prompts (No schema vs. With schema) and two inputs (Vague quantity vs. Precise quantity). We let LLM to generate the outputs for these 4 conditions:

R
Python

prompt_old = "
You are data parsing assistant. 
User provides a list of groceries. 
Your goal is to output is as JSON.
"

prompt_schema = "
You are data parsing assistant. 
User provides a list of groceries. 
Use the following JSON schema to generate your response:

{{
    'groceries': [
        { 'name': ITEM_NAME, 'quantity': ITEM_QUANTITY }
    ]
}}

Name is any string, quantity is a numerical value.
"

vague_input <- "I'd like to buy some bread, pack of eggs, few apples, and a bottle of milk."
precise_input <- "12 eggs, 2 bottles of milk, 6 sparkling waters."

tabset_condition <- apply(expand.grid(c("No_Schema", "With_Schema"), c("Vague_Input", "Precise_Input")), 1, \(x) paste0(x[1], "+", x[2]))

dat_prompt_input <- data.frame(
  expand.grid(
    prompts = c(prompt_old, prompt_schema),
    inputs = c(vague_input, precise_input), stringsAsFactors = FALSE
  )
  
)

dat_output <- dat_prompt_input |> 
  rowwise() |> 
  mutate(
    json_data =get_response(prompt = prompts, message = inputs) 
  )

prompt = """
You are data parsing assistant. 
User provides a list of groceries. 
Use the following JSON schema to generate your response:

{{
    "groceries": [
        { "name": ITEM_NAME, "quantity": ITEM_QUANTITY }
    ]
}}

Name is any string, quantity is a numerical value.
"""

inputs = [
    "I'd like to buy some bread, pack of eggs, few apples, and a bottle of milk.",
    "12 eggs, 2 bottles of milk, 6 sparkling waters.",
]

for message in inputs:
    res = eval(prompt=prompt, message=message)
    json_data = res.choices[0].message.content
    print(json_data)

The outputs for 4 conditions show that the condition of the prompt with schema and the input with precise quantity has the most clear LLM output.

Function to convert JSON to Markdown Table

json_to_markdownTBL <- function(x) {
  dat <- as.data.frame(fromJSON(x)) # convert JSON into data.frame
  md_dat <- paste0(kableExtra::kable(dat, format = "simple"), collapse ="\n")
  return(md_dat)
}
data_frame_output <- sapply(dat_output$json_data, json_to_markdownTBL) |> 
  unlist()

Compile pieces of informatiom into output

content <- paste0(glue("
### {tabset_condition}
**Prompt**:\n{dat_output$prompts}\n\n
**Input**:\n\n{dat_output$inputs}\n\n
**LLMOutput**:\n\n{dat_output$json_data}\n\n
**R DataFrame**:\n\n{data_frame_output}\n\n
"), collapse = "")
glue("::: panel-tabset\n{content}:::")

Prompt:

You are data parsing assistant. User provides a list of groceries. Your goal is to output is as JSON.

Input:

I’d like to buy some bread, pack of eggs, few apples, and a bottle of milk.

LLMOutput:

{ “groceries”: [ “bread”, “pack of eggs”, “few apples”, “bottle of milk” ] }

R DataFrame:

groceries
bread
pack of eggs
few apples
bottle of milk

Prompt:

You are data parsing assistant. User provides a list of groceries. Use the following JSON schema to generate your response:

{{ ‘groceries’: [ { ‘name’: ITEM_NAME, ‘quantity’: ITEM_QUANTITY } ] }}

Name is any string, quantity is a numerical value.

Input:

I’d like to buy some bread, pack of eggs, few apples, and a bottle of milk.

LLMOutput:

{ “groceries”: [ { “name”: “bread”, “quantity”: 1 }, { “name”: “pack of eggs”, “quantity”: 1 }, { “name”: “apples”, “quantity”: 3 }, { “name”: “bottle of milk”, “quantity”: 1 } ] }

R DataFrame:

groceries.name	groceries.quantity
bread	1
pack of eggs	1
apples	3
bottle of milk	1

Prompt:

You are data parsing assistant. User provides a list of groceries. Your goal is to output is as JSON.

Input:

12 eggs, 2 bottles of milk, 6 sparkling waters.

LLMOutput:

{ “groceries”: [ { “item”: “eggs”, “quantity”: 12, “unit”: “pieces” }, { “item”: “milk”, “quantity”: 2, “unit”: “bottles” }, { “item”: “sparkling water”, “quantity”: 6, “unit”: “bottles” } ] }

R DataFrame:

groceries.item	groceries.quantity	groceries.unit
eggs	12	pieces
milk	2	bottles
sparkling water	6	bottles

Prompt:

You are data parsing assistant. User provides a list of groceries. Use the following JSON schema to generate your response:

{{ ‘groceries’: [ { ‘name’: ITEM_NAME, ‘quantity’: ITEM_QUANTITY } ] }}

Name is any string, quantity is a numerical value.

Input:

12 eggs, 2 bottles of milk, 6 sparkling waters.

LLMOutput:

{ “groceries”: [ { “name”: “eggs”, “quantity”: 12 }, { “name”: “bottles of milk”, “quantity”: 2 }, { “name”: “sparkling waters”, “quantity”: 6 } ] }

R DataFrame:

groceries.name	groceries.quantity
eggs	12
bottles of milk	2
sparkling waters	6

4 Resources

YouTube Video: OpenAI Just Changed Everything (Responses API Walkthrough)
- GitHub Repo

Citation

BibTeX citation:

@online{zhang2025,
  author = {Zhang, Jihong},
  title = {LLM {Cookbook} for {Psychometricians}},
  date = {2025-03-07},
  url = {https://www.jihongzhang.org/posts/2025/2025-04-03-Large-Language-Model-Psychometrics/content/chatgpt.html},
  langid = {en}
}

For attribution, please cite this work as:

Zhang, J. (2025, March 7). LLM Cookbook for Psychometricians. https://www.jihongzhang.org/posts/2025/2025-04-03-Large-Language-Model-Psychometrics/content/chatgpt.html

1 GPT-4.1

1.1 Mini Study 1: Likert-scale item response generation

Result

1.2 Mini Study 2: Probability-based generation

Result

2 Response API Walk-through

2.1 Image example

Use AI to reproduce academic figure

Original Plot

AI-generated Plot

2.2 Conversation State

2.3 Example Item:

3 Example 1: Construct LLM function calling with Python and R

3.1 response_format: Structured Response

3.2 Structure as a HTML list

3.3 Utilize Schema to Define Data Shape

4 Resources

Citation

3.1 `response_format`: Structured Response