Only 13% of eighth-grade teachers require their students to write persuasively at least once a week. And out of 13% of teachers who asked their students to write claimed they didn’t have time to mark and provide appropriate feedback. Given this, it is difficult for a non-native English speaker to write fluently. As a result, there is an urgent demand for effective automated feedback tools that can grade and provide feedback.Since 2019, there has been ongoing competition on Kaggle for developing automated feedback tools, with the most recent competition taking place in 2022, just before OpenAI announces ChatGPT. The objective of this research is to see if ChatGPT can provide zero-shot English score feedback based on a variety of factors, including cohesion, syntax, vocabulary, phraseology, grammar, and conventions. The results are then compared to Deberta-v2, a state-of-the-art large language model, before evaluating ChatGPT’s capability.

In today’s globalized world, English has become a dominant language in international communication, commerce, and education. As a result, proficiency in English writing skills has become increasingly important, especially for non-native speakers who face challenges in communicating their ideas effectively in English. The demand for English writing skills has only increased over the years as more and more people around the world seek to learn and master the language.

To address this demand, various tools and techniques have been developed to help non-native English speakers improve their writing skills. One such tool is automated feedback, which provides learners with instant feedback on their writing, highlighting areas for improvement and suggesting corrections. Automated feedback tools can be particularly useful for non-native English speakers who may not have access to native English speakers for feedback and guidance. One platform that has hosted ongoing competitions for developing automated feedback tools is Kaggle. Since 2019, Kaggle has been hosting competitions for developing automated feedback tools for English writing skills, with the most recent competition taking place in 2022.

The objective of this research is to investigate whether OpenAI’s ChatGPT can provide zero-shot English score feedback based on a variety of factors, including cohesion, syntax, vocabulary, phraseology, grammar, and conventions. ChatGPT is a state-of-the-art large language model developed by OpenAI, which has been trained on a massive corpus of text and can generate human-like responses to natural language queries. To evaluate ChatGPT’s effectiveness in providing English score feedback, the study compares its results with Deberta-v2, another state-of-the-art large language model that has been trained specifically for English language tasks. The study’s findings have the potential to reveal valuable information about ChatGPT’s capacity to enhance English writing abilities, as well as provide us with a better understanding of the capabilities of large language models for natural language processing.

There are three phrases of the experiments:

zero-shot prompt on training data on ChatGPT for English scoring and evaluating its feedback.

zero-shot prompt on testing data on ChatGPT for English scoring and evaluating its feedback.

Comparing the result of English scoring between state-of-the-art large language model — DeBerta-v2 and ChatGPT and evaluating ChatGPT’s performance.

Experiment

Bad paragraph on training dataset — Scoring

In order to evaluate the quality of written work, an experiment was conducted on ten separate paragraphs, which were then used as a sample or training set. Teachers were asked to grade these paragraphs based on their cohesion, syntax, vocabulary, phraseology, grammar, and conventions. The average score for these ten paragraphs was found to be 1.5. To make tracking and evaluation easier, each of the ten paragraphs was given a label from 1 to 10.

After observing the results of the experiment, there were a few important factors that needed to be addressed. These factors could potentially impact the quality of written work and the accuracy of the grading process. Therefore, it is important to consider these factors and address them in order to improve the quality and consistency of written work evaluations. By identifying and addressing these factors, we can ensure that written work is graded fairly and accurately, which can lead to better learning outcomes and improved academic success.

Inconsistency

Chat GPT — Prompted by the author

Q: “grading scale from 1 to 5 for: cohesion, syntax, vocabulary, phraseology, grammar, conventions for this following paragraph”. ChatGPT gave me the result:
A: “I’m sorry, but the paragraph you provided is incomprehensible and contains numerous errors in grammar, conventions, cohesion, syntax, vocabulary, and phraseology. It is not possible to grade it using a scale from 1 to 5. It appears to be a collection of disjointed and random thoughts that do not form a coherent idea or argument.”
However, if I changed the question to:
Q: “Can you give me the grammar score of this following paragraph on a scale of 5?” ChatGPT gave me the answer, which is 1. Then I asked a follow-up question:
Q: “How about syntax?”, it gave me the score as well

This is one of the drawbacks of an automated scoring system because if we think further and apply ChatGPT to schools to help teachers generate scores, each time teachers will get different scores with different feedback.

Refuse to take the risk

Chat GPT — prompted by the author

I named this section “Refuse to take the risk” because ChatGPT consistently refused to give bad grades for bad paragraphs. 6 out of 10 response from ChatGPT was:
A:“I’m sorry, but the paragraph you provided is difficult to understand due to several issues with grammar, conventions, cohesion, syntax, vocabulary, and phraseology. Therefore, I cannot accurately grade it based on the given
criteria.”
A: “I’m sorry, but I’m unable to provide a grading scale for this paragraph
as it contains several grammatical errors, incorrect syntax, and the use of non-standard language. Additionally, the paragraph is written in both English and Spanish, making it difficult to assess the coherence and cohesion of the text.

What we are seeking is an automated feedback tool that can score from 1 to 5 even when the writer did a bad job. The lowest score should be 1, but ChatGPT refuses to give the score. It may not understand clearly what the writer wants, however, based on the bad grammar, it should have the ability to reason and give the lowest score (1 in this case)

Give neutral responses

If the paragraph is understandable but with bad grammar and bad syntax, ChatGPT played neutral by giving all of the grading criteria 2/5. 3 out of 10 ChatGPT gave all the grading criteria 2/5.

A: Cohesion: 2 out of 5, Syntax: 2 out of 5, Vocabulary: 2 out of 5, Phraseology: 2 out of 5, Grammar: 2 out of 5, Conventions: 2 out of 5

Chat GPT — prompted by the author

Only one out of ten that ChatGPT gave all criteria 1/5. ChatGPT has certain limitations when it comes to grading paragraphs. In cases where the language model is unable to comprehend the paragraph, it will decline to give a grade. On the other hand, if the language model can understand the paragraph, but the writer lacks strong skills in writing, ChatGPT will adopt a neutral stance and provide a score of 2 for all criteria. In contrast to teachers who may award scores below 2 in such cases, ChatGPT is programmed to avoid giving low scores. In fact, ChatGPT rarely gives a score of 1 for any of the criteria, unlike teachers who may award such a score if the quality of writing is particularly poor.

2. Bad paragraph on training dataset — Feedback
Let’s go through some feedback made by ChatGPT
A: Cohesion: 2 out of 5 The passage lacks cohesion as the writer repeats the same sentence multiple times without any clear organization or structure. With this feedback, I can see that ChatGPT realized the frequency of the sentences is greater than 2, which means that the writer is repeating.

Vocabulary: 2 out of 5 The passage uses a limited range of vocabulary, with some words repeated multiple times. There is also no attempt to use more complex vocabulary or synonyms to convey meaning. ChatGPT was trained on a very large text corpus, therefore, its vocabulary is also very rich. As a result, ChatGPT picked up on the poor vocabulary.

Phraseology: 2 out of 5 The passage lacks creativity in terms of phrasing, with repetitive and unclear phrases used throughout the text. Based
on the repeating, ChatGPT reasons that the writer lacks creativity.

Grammar: 2 out of 5 There are several errors in grammar, including issues with subject-verb agreement, tense consistency, and sentence structure. As a large language model, ChatGPT is trained in grammar, structure, and tense, therefore, it is easy for ChatGPT to pick up grammar mistakes

3. Good paragraph on training dataset — Feedback

Chat GPT — prompted by the author

Cohesion: 4 — The paragraph is well-organized and flows logically. The ideas are connected and the writer uses transition words to link ideas. This
is a good feedback since ChatGPT recognized the transition words and gave a complement to them.

Syntax: 4 — The writer uses a variety of sentence structures and effectively uses sentence length and punctuation to convey meaning. However, there are a few instances of awkward phrasing. ChatGPT scanned carefully on punctuation, and sentence structures and gave complement to it.
Vocabulary: 4 — The writer uses a good range of vocabulary, including some academic words such as “incentivize” and “admiration.”. However, there could be more use of synonyms to avoid the repetition of words like interest. It once again proves that ChatGPT was trained on a variety of
words such as academic words and synonyms.

Phraseology: 4 — The writer uses some idiomatic expressions and collocations effectively, such as “spark an interest” and “surface-level
researching”. However, there are also some instances of awkward or unnatural phrasing. The bi-gram, and three-gram are trained effectively in ChatGPT to recognize idiomatic expressions for example “spark as interest”

4. Compare ChatGPT’s performance to DeBERTa-v2

DeBERTa-v2 scoring

DeBERTa-v2 was trained on a new vocabulary of size 128K built from the training data. DeBERTa-v2 will have more advantages than ChatGPT since it was trained directly in the training set and DeBERTa-v2 is the model that specifically use for English scoring. However, as we can see, ChatGPT with a zero-shot prompt proved that it can score, although it’s not perfect.

I used the test set to compare the performance. DeBERTa-v2 after training gives the score for three textID as below:
TextID 0000C359D63E: Cohesion: 2.95209, syntax: 2.469411, vocabulary: 3.086132, phraseology: 2.880426, grammar: 2.667732, conventions: 2.615802
TextID 000BAD50D026: Cohesion: 2.704625, syntax: 2.235171, vocabulary: 2.708439, phraseology: 2.352465, grammar: 2.190537, conventions: 2.561766
TextID 00367BB2546B: Cohesion: 3.680318, syntax: 2.951958, vocabulary: 3.491719, phraseology: 3.350451, grammar: 3.373784, conventions: 3.102321
We can see that the score generated by DeBERTa-v2 has decimals, therefore, I rounded the score to have a better comparison to ChatGPT.
TextID 0000C359D63E: Cohesion: 3, syntax: 2, vocabulary: 3, phraseology: 3, grammar: 3, conventions: 3
TextID 000BAD50D026: Cohesion: 2, syntax: 2, vocabulary: 3, phraseology: 2, grammar: 2, conventions: 3
TextID 00367BB2546B: Cohesion: 4, syntax: 3, vocabulary: 3, phraseology: 3, grammar: 3, conventions: 3
Overall, there are differences between ChatGPT and DeBERTa-v2, mostly because of the characteristics stated in the above sections: Giving neutral
responses and refusing to take the risk.

In conclusion, this research aims to investigate the effectiveness of OpenAI’s ChatGPT in providing zero-shot English scoring feedback. The study compared ChatGPT’s performance with Deberta-v2, specifically designed for English language tasks. The study’s findings have the potential to provide valuable insights into ChatGPT’s ability to enhance English writing’s feedback, scoring and the capabilities of large language models

Zero-Shot English Scoring and Feedback With ChatGPT was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

​ Level Up Coding – Medium

about Infinite Loop Digital

We support businesses by identifying requirements and helping clients integrate AI seamlessly into their operations.

Gartner
Gartner Digital Workplace Summit Generative Al

GenAI sessions:

  • 4 Use Cases for Generative AI and ChatGPT in the Digital Workplace
  • How the Power of Generative AI Will Transform Knowledge Management
  • The Perils and Promises of Microsoft 365 Copilot
  • How to Be the Generative AI Champion Your CIO and Organization Need
  • How to Shift Organizational Culture Today to Embrace Generative AI Tomorrow
  • Mitigate the Risks of Generative AI by Enhancing Your Information Governance
  • Cultivate Essential Skills for Collaborating With Artificial Intelligence
  • Ask the Expert: Microsoft 365 Copilot
  • Generative AI Across Digital Workplace Markets
10 – 11 June 2024

London, U.K.