Automated Python Marking
- Providing support
- Marking – simplemarker
- The process – benefits and problems
- Example of student errors
- Examples of students’ inefficiencies
- Lent term additions
- The future
- Available products for marking
Each year 300+ new students learn Python using a tutorial (Jupyter notebooks). They need to complete exercises (Jupiter notebooks) using a cloud facility like colaborator. In October 2020 we attempted to mark these exercises using automation. This replaced 75 expensive and tiring person-hours of face-to-face interaction. As an extra challenge we didn’t change the exercises (many of which used asserts anyway). The first batch of 10,000+ programs were non-graphical, which helped. This document reports on the results. Marking needs to be part of a wider framework, so help provision is considered here too.
Providing support
In the past, support was an online Forum and a daily face-to-face drop-in help desk, neither of which were used much, (especially given the size of the cohort). Marking was done face-to-face. Covid forced us to reconsider these procedures. We decided to advertise to students a range of methods of getting help –
- You may well need to re-read the provided notebooks several times. Everything you need to know is there. If you rush too quickly to the Exercises you might be able to muddle through the early ones, but sooner or later you’ll get stuck.
- The “Ask a Question” Forum on the Moodle page is open all the time. Anybody can answer. Posts can lead to a Zoom dialog if you want.
- The “Support Chatrooms” on the Moodle page are open 2-3pm. One helper per chatroom. Posts can lead to a Zoom dialog if you want.
- Other tutorials – one reason we teach Python is that there are many online tutorials to suit all tastes. The main Python site has a list for beginners on https://wiki.python.org/moin/BeginnersGuide/Programmers
- The University provides general and specialist courses – see https://training.cam.ac.uk/ucs/search?type=events&query=python
- Online help – If the local Forum doesn’t suit you, try https://stackoverflow.com/questions/tagged/python-3.x. If you don’t know what an error message (e.g. “‘NoneType’ object is not subscriptable”) means, just copy/paste it into a search engine. Read How do I ask a good question?
Virtual support seemed to work ok (Mich help-support costs were 30% of last year’s – even so, the helpers during the 2-3pm sessions often had nothing to do for complete sessions). 85 questions were asked on the Forum in 6 weeks during Mich 2020. Students liked sending screendumps rather than code. Moodle offers facilities to run online helpdesks. Marking was a trickier issue.
simplemarker
Many automarkers (some commercial) are available – see the list below. Many are language neutral. Most compare the program output with expected output (sometimes demanding an exact match). Some integrate with databases.
Merely comparing program output with expected output restricts the type of questions that can be asked, so we wrote our own simplemarker
program. Given a folder of students’ replies to a programming exercise, and a list of tests, simplemarker
- checks to see if the programs ask for user input
- checks for infinite loops
- checks to see if the programs pass a set of tests
- optionally checks for duplicate replies
- returns lists of files that pass, etc.
Here’s the code fragment for marking the solutions to the exercise about writing an “is_odd” function
testisodd1={"FunctionExists": {"FunctionName":"is_odd"}} testisodd2={"CheckReturnValue": {"FunctionName":"is_odd","Input":3,"Returns":True}} testisodd3={"CheckReturnValue": {"FunctionName":"is_odd","Input":4,"Returns":False}} d=marking.simplemarker("testfolder", [testisodd1,testisodd2,testisodd3],label="Mark is_odd",hashcheck=True) printanswers(d)
and here’s the output
*** Mark is_odd PASS - ['abc123.py', 'tl136.py'] FAIL - ['sms67.py'] ASKS FOR USER INPUT - ['xxx.py'] INFINITELOOP - ['yyy.py'] HASH - b5db5badc627e448e07616fcd606064a 2
(the last line means that 2 solutions have the same checksum, which would be suspicious with a large number of programs)
The tests available are
- Does a named function exist?
- Does a named function (or the whole file) contain a particular string or regex?
- Does a named function (or the whole file) have less than (or more than) a specified number of lines?
- Does the output of a file have more than (or less than) the specified number of line?
- Is the named function recursive?
- Given the specified input, does a named function return the specified value?
- Does the file print the specified output?
- Does a new file which contains a named function and some specified extra lines produce the specified output when run?
The aim wasn’t for complete automation in the first year, but to reduce the need for human interaction using the real-life sample set to refine the simplemarker program.
The process – benefits and problems
Students submit a zip file of 6 jupyter notebooks via Moodle (our CMS system). 100 lines of bash/Python code extract the python programs and generated images from these, putting them into an appropriate folder (there’s one for each exercise) and reporting on missing programs. 1000 more lines of Python code (still under development) do the marking, producing a “Right” or “Wrong” result for each file, and sometimes diagnosing the problem. At the end there’s a list of students/program that have failed.
The process for milestone 2, Mich 2020 was
- In Moodle, choose the “Download all submissions” “Grading action”
- Unzip these into an empty folder – I did
mkdir /tmp/marking cd /tmp/marking unzip ~/Downloads/*notebooks*7-12*.zip mv */*zip . rm -rf *_file_
- Remove already-marked submissions
bash ~/python-ia-marking/deletemarkedsubmissions
- Create a folder for each exercise (actually, each cell of an exercise). Extract the programs and png files (the png files are the program’s graphical output) –
bash ~/python-ia-marking/extractallmilestone2
In each folder there’s a CRSID.py file for each student, and a CRSID.png file if graphics were produced
- Mark –
python ~/python-ia-marking/do_the_markingMilestone2.py
This outputs lists of CRSIDs of passes and failures for each question. For milestone 2 on an old laptop it takes about 20 minutes.
Benefits
- More rigorous, consistent checking of straightforward issues.
- The ability to check for cheating
- Gathering statistics to identify what students struggle with –
- Well over 10% of students fail to extract the 3rd column of a matrix using slicing (they extract the wrong column, a row, or list each element). Anything not tested by asserts
is likely to have a rather high fail rate. - Over 20% of students fail to use recursion when asked to write a recursive factorial function – they iterate (thinking this is recursion) or call math.factorial.
- For milestone 1 in Mich 2020, 294 corrections were requested from 199 students. Many re-corrections were requested too.
- Well over 10% of students fail to extract the 3rd column of a matrix using slicing (they extract the wrong column, a row, or list each element). Anything not tested by asserts
Problems
- Submission trouble – The instructions were “The single zip file you submit should have your ID, a full-stop, and a “zip” suffix in its name and nothing else (e.g. tl136.zip). It should contain only the 6 notebook files, the files having the suffix ipynb“. About 5% of students submit wrongly – multiple files instead of a single file, wrongly named single file, pdf files archived, suffices removed from all the archived files, oriental characters added to the suffices of archived files, zip files archived, etc.
- Remoteness – When the course was C++ and lab-based (timetabled to run twice a week), it was possible to check that students were keeping up (marking was done in each session). Struggling students could be given extra help, often benefitting from long 1-to-1 discussions. When the course became python and self-taught in 2016, students who self-identified as weak could (and did) regularly visit the daily helpdesks and get individual tuition. Of course, unconfident student could still hide away, but opportunities were there. Covid has increased the difficulties that staff have identifying and contacting the students who most need help.
- Pedantic automation – I soon realised during the trial run that automation would reject many programs that human markers would accept. When asked to write a function called
area
, many students wroteArea
,triangle_area
, etc. When asked to print a value out they only calculated it. Pedantry is part of computing, but it can detract from the learning experience. I told students that automated marking will increase, so they’ll need to get used to it - Expert programmers need more feedback – Having spent time doing the optional sections and finding ingeniously short answers, students want more than a “No news is good news” or “Congratulation! You’ve passed” response.
- Poor programmers need more feedback – Telling students that they need to re-do a question is frequently unproductive. As is telling them to read the question more carefully (see the next section).
- Students finish the exercises without trying to understand the material – This common problem is made worse by the remote learning context. It’s not unusual for students to bypass much of the teaching material, starting by looking at the exercises (which are in separate documents from the tutorials). Only if they get stuck will they refer back (very selectively) to the tutorial. Consequently when they are told why their program fails they can’t correct it because they don’t understand the helper’s description of the mistake. They don’t know what recursion or dictionaries are (one student asked me if a vertical dictionary was ok). When asked to create a list, they use the common meaning of “list” rather than the Python concept. They don’t understand the difference between a function printing the answer and returning the answer (after all, both appear on screen). I tell them to read the teaching material and ask questions about that before attempting the exercises. Face-to-face interaction is needed at this point.
In Mich 2020 milestone 1 I used automation to filter out correct entries, looking at the files flagged as wrong. When I mailed students for corrections I always explained briefly what was wrong. All the students who failed to use recursion when asked to received the same bulk-mailed message. In most other cases I needed to send different messages to subsets of students whose solution failed.
I invited students to mail me or come to the daily helpdesk if they wanted feedback.
In Mich 2020 milestone 2, things were smoother. And I realised that graphical output could be extracted from notebooks.
Example of student errors
Getting students to correct their code isn’t a trivial matter. Students are asked to “Write a function that uses list indexing to add two vectors of arbitrary length, and returns the new vector. Include a check that the vector sizes match, and print a warning message if there is a size mismatch.” They’re given some code to start –
def sum_vector(x, y): "Return sum of two vectors" a = [0, 4.3, -5, 7] b = [-2, 7, -15, 1] c = sum_vector(a, b)
Over 20% of students failed this exercise. It (and a question about dictionary manipulation) was the most re-corrected exercise. Several things (many not tested by asserts) can go wrong –
- They forget entirely to check vector lengths
- They check vector lengths, but outside of the function (they check the lengths of a and b before calling
sum_vector
). Explaining why this is bad can take a while - They overwrite the given x and y values in the function – e.g.
x= [0, 4.3, -5, 7]
, etc. (perhaps because they have to use variables calleda
andb
when calling the function, but the function wants variables calledx
andy
). - They check vector lengths in the function, in the loop they use to sum the vectors
- They check vector lengths in the function, but only after they’ve tried to sum the vectors
- They check vector lengths in the function before they sum the vectors, and they print a message if there’s a mismatch, but they don’t return, so the program crashes anyway
- They check vector lengths in the function at the right time, bypassing the maths if there’s a mismatch but forgetting to print a message
- If there’s a size mismatch they return a string containing the warning message
- If there’s a size mismatch they return
print('Size mismatch')
without appreciating what they’re returning.
I’ve had situations where, when told about the first error listed here, they make the second error. When told about the second error, they make another one. And so on! They’re blindly jumping through hoops.
Examples of students’ inefficiencies
Some long-winded ways of solving problems aren’t bugs, but a human, face-to-face marker would fix them on the spot. Here are some examples –
- String repetition – a question begins by giving them a passage from Shakespeare, asking them to produce a string that repeats it a 100 times. A few students use copy/paste
- Dice outcomes – a question asks them to repeatedly roll a simulated die and collect results. Instead of using “
frequency[outcome]+= 1
” several students try something likeif outcome == 1: no1 += 1 elif outcome == 2: no2 += 1 elif outcome == 3: no3 += 1 elif outcome == 4:
Fortunately it’s only a 6-sided die.
- String comparison – a question asks them to write a
__lt__
method. This involved string comparison. Dozens of students instead of usingself.surname < other.surname
write something likeif len(self.surname) < len(other.surname): for i in range(len(self.surname)): if alphabet.index(self.surname.upper()[i]) < alphabet.index(other.surname.upper()[i]): return True elif alphabet.index(self.surname.upper()[i]) > alphabet.index(other.surname.upper()[i]): return False
etc., etc.
I don’t think a programming course should let these pass. We could identify at least some of these by checking on the program/function length, but human follow-up is required.
Lent term additions
counttests
added – a bash script to count pytestsgitloganalysis
added – a bash script to check who made commitsmarking.linesofoutputcheck
added
I ran the marking program and the programs themselves using a separate test account with a non-writeable python distribution. A bash script unzipped folders and ran
counttests
– to see if a reasononable number of tests existedgitloganalysis
– if .git existed this gave a list of how many times each person committed
then ran a marking program. More checks were made to see whether non-mandatory but useful features were used. This helped to make feedback to students more flexible. For example, I checked to see whether the “legend” command was used, and wrote “Legends on the graphs would have helped” if it wasn’t.
The future
automarker
changes –- Diagnosing common bugs so that students can be told more precisely what’s wrong with their code. By anticipating and testing for common bugs (rather in the way that multiple choice options anticipate common mistakes) perhaps better diagnosis is possible. I started doing this for milestone 2 when I realised that students were often summing the 3rd row of a matrix rather than the 3rd column.
- Diagnosing common style issues.
- Add mail=yes/no from=emailaddress message=”” fields for each test to facilitate automation of mailed feedback
- Mailing students about all their mistakes in one go, rather than mailing students question by question.
- Listing students who did the optional questions, so they can be praised and encouraged.
- Some changes in the questions’ wording and greater ruthlessness in rejecting invalid submissions would save a lot of time.
- When an question asks then to produce and print a list, they’re likely to print a sequence of items (i.e. a list rather than a Python list). Maybe we could tell them the name that the list should be given.
- Some exercises look optional to some of them though they’re compulsory. Exercise 06.3 only has an editable cell under the optional part, so they didn’t do the first part (or they made a new cell for it). But all the questions have this layout with cells at the end, so I don’t know why 06.3 is a problem.
- Some exercises (e.g 05.2) encourage the use of multiple cells to write the code in. Students used 1-4 of the cells. One cell would be easier for the automarker
- Maybe more of the open-ended parts of questions could become optional – in 05.1 for example.
- An interactive system? – students could drop files into a webpage and get instant feedback.
- Automarking needs to be part of an integrated support environment. Perhaps automarking should be a first phase dealing with the bulk of submissions, then humans should deal with the more problematic cases. Demonstrators could deal with students who did the optional extras, or who wanted more feedback on style, etc.
Available products for marking
- mOCSIDE (you need to be logged into github before following the link). Uses Docker to run the code. The report in a 2022 “Computing Sciences in Colleges” issue mentions Web-cat, OK and Autolab as three of the more prominent free systems, and zyLabs, zyBooks, Coding Room as non-free options.
- Coderunner – Moodle plug-in
- Virtual Programming Lab – A Moodle plug-in designed to provide feedback as part of a MOOC. A report is on-line
- https://github.com/marovira/marking – (Developed for first year courses at UVic). Any deviation from the expected output (whitespace, spelling, capitalization, new lines, etc) will be flagged as an error
- http://web-cat.org/ – “It is highly customizable and extensible, and supports virtually any model of program grading, assessment, and feedback generation. Web-CAT is implemented as a web application with a plug-in-style architecture so that it also can serve as a platform for providing additional student support services”. Free.
- https://github.com/autolab – “Web service for managing and auto-grading programming assignments”. Deals with course administration too – see https://autolab.github.io/docs/. Tango does the marking. “Tango runs jobs in VMs using a high level Virtual Memory Management System (VMMS) API. Tango currently has support for running jobs in Docker containers (recommended), Tashi VMs, or Amazon EC2.” But it seems to be more of a framework for automarking, each course-leader having to produce assessment code
- https://pypi.org/project/markingpy/ – same philosophy as my attempt, though more sophisticated. It can do timing tests. Alas the link to the docs is broken – https://markingpy.readthedocs.io.
- https://www.autogradr.com/ – Not free.
- https://github.com/GatorEducator/gatorgrader – same philosophy as my attempt. Has tests like “Does source code contain the designated number of language-specific comments?”, “Does source code contain a required fragment or match a specified regular expression?”, “Does a command execute and produce output containing a fragment or matching a regular expression?”
- https://github.com/apanangadan/autograde-github-classroom – “Currently, grade-assignments.py will get stuck if one of the student repos goes into an infinite loop.”
- https://github.com/cs50/check50 – using YAML files it’s easy to check if programs produce the right output for specified input. But I don’t think it can do other checks.
- https://classroom.github.com/assistant – “Track and manage assignments in your dashboard, grade work automatically, and help students when they get stuck”
- https://github.com/jupyter/nbgrader – an extension for notebooks that uses asserts. I think we use it.
- https://github.com/kevinwortman/nerfherder – “a collection of Python hacks for grading GitHub Classroom repositories.”
- https://github.com/Submitty/Submitty – “Customizable automated grading with immediate feedback to students. Advanced grading tools: static analysis, JUnit, code coverage, memory debuggers, etc.”
- https://www.codio.com/features/auto-grading – Not free.
- Teams for Education – A complete environment. Free options, but to get marking options it’s $35/team/month