When AI Can Use a Computer, but Still Misses the Important Window (A Backup Automation Experiment)

How I Used ChatGPT and Codex to Build a Monitored Backup Workflow — and What That Taught Me About Automation

 

What started as a simple backup task turned into something much more interesting.

Yes, the technical goal mattered: I wanted my always-on Windows system Aether to run a daily backup of my local Google Drive mirror to Athena, my always-on server, and I wanted LibreNMS to send me an email if that backup failed. The final design ended up being a clean daily 22:00 FreeFileSync Update job, wrapped in PowerShell, logged into the Windows Application log, forwarded through NXLog, and monitored by LibreNMS with a critical alert rule called Aether Backup Failed. But the real story was not just the backup. 

The real story and experiment was what happened when I tried to build that workflow with AI automation!

This project became a very practical lesson in what AI is already good at, what it is bad at, and what “computer use” actually means once you leave neat terminal commands and start dealing with installers, hidden windows, vendor-specific GUI settings, and software that behaves differently depending on whether a human is there to click something.

In other words: this was less a backup post, and more a field report on AI-assisted administration.


Project Overview

The final target was straightforward.

Aether, my Windows mini PC, would back up:

C:\Users\Admin\Meine Ablage

to this UNC path on Athena:

\\192.168.10.2\exthdd\GDrive_Backup

Google Drive was already part of the bigger strategy before this project started. The important data was not living on only one machine. Google Drive sync was installed on several hosts, so the working data already existed in multiple synced locations, including Aether. But sync is not the same thing as backup. If Google Drive ever propagates an accidental deletion or loses data, that problem can spread through the synced copies as well. That is why this project added one more layer: an extra FreeFileSync Update backup outside of Google Drive. The point of that extra copy was exactly that it should not behave like sync. It should copy new and changed files to Athena, but it should never mirror deletions back to the target just because Google Drive deleted something by mistake. FFS is an old friend from university (the app, yes) and it is open source and gets updated still, so I dont need Veam or stuff like that. It's really more of a on top backup strategy I wanted to apply for a while. :)

The sync mode would be Update, not Mirror. That meant one-way copy/update behavior, no deletion on the target, and no versioning. The job would run once per day at 22:00 through Windows Task Scheduler. A PowerShell wrapper would start FreeFileSync, wait for the exit code, and then write one of two fixed Windows Application log messages:

DPROJECT_BACKUP: SUCCESS

or

DPROJECT_BACKUP: FAILED

NXLog, which was already running on the Windows box, would read those events and forward them to LibreNMS as syslog. LibreNMS would then look for the failure string and send me an email alert.

So the finished chain looked like this:

Aether
→ Task Scheduler
→ PowerShell wrapper
→ FreeFileSync batch job
→ Windows Event Log
→ NXLog
→ LibreNMS syslog
→ email alert

That is the neat version.

The actual journey was messier, and much more interesting. 

Final architecture diagram: Google Drive sync on multiple hosts, plus one independent Update-style backup outside Google Drive, with LibreNMS alerting on failure. It's sloppy but an image in between als the text is good on the eye:


The Team: Daniel, ChatGPT, and Codex

Like in my earlier homelab post, this was not “AI does everything while I watch.” It was a three-part team.

I had the real machine, the real GUI, the real network share, and final responsibility for what was allowed to happen. I also had the final say on the backup behavior, the safety constraints, the screenshots, and what counted as real proof.

ChatGPT acted as the planning and management layer. That meant helping define the architecture, challenging weak assumptions, translating vague goals into a structured workflow, and making sure the design stayed understandable instead of becoming a pile of improvised fixes.

Codex acted as the execution layer. Once a task was sufficiently clear, Codex could create folders, prepare scripts, attempt installs, document findings, verify paths, and produce technical notes much faster than I would normally write them by hand. In this project it built the scaffolding in C:\Codex, wrote the wrapper script, documented the setup, and helped drive the whole process forward. I ended up having to do more QA on this one though.

That division worked well so far. I wanted to test the new computer use abilities of codex with the new codex windows app. It exposed something important: AI can move quickly through structured work, but it still depends on a human when the truth is stuck inside a GUI popup, a hidden dialog, or a piece of software behaving in a way the AI cannot really observe.

That happened almost immediately.


Step 0: The First Reality Check — “Silent Install” Was Not Silent

At the start, the idea sounded simple enough: let Codex install FreeFileSync and move on.

Codex tried to automate the install. The problem was that the install did not actually behave like a clean unattended process. Based on what I saw on the Windows side, it looked like the GUI was waiting for something interactive, and it may even have been complaining that the silent-install style behavior required the donation edition of the app. I am deliberately phrasing that carefully, because the key point is not the exact licensing message. The key point is what happened operationally: Codex was waiting for the install to complete, while the actual GUI seemed to be sitting there with a state or popup that Codex was not truly seeing.

That is one of the best examples from the whole project.

People hear “AI can use a computer” and imagine something like full visual awareness plus perfect control. In practice, what you often get is partial computer use. The AI can launch the installer. It can wait on the process. It can reason about likely flags. It can retry. But if the important truth is in a small window, a stalled installer state, or a dialog the automation layer is not really perceiving correctly, then you get a mismatch between what the AI thinks is happening and what the human sees on screen.

That is exactly what happened here.

Eventually the installation did complete correctly after retries and more explicit handling, and the installed executables were verified here:

C:\Program Files\FreeFileSync\FreeFileSync.exe
C:\Program Files\FreeFileSync\Bin\FreeFileSync_x64.exe

So yes, the AI helped push the install forward. But the deeper lesson was this: GUI state is still reality, and AI often does not have the same grasp of it that a human at the keyboard has.

Codex ended up waiting for the installer to finish, while I saw the popup and finished the install....


Step 1: Defining the Safe Backup Design Before Touching the Job

 

Before building the actual backup task, we first nailed down the safety rules.

This was not supposed to be a mirror that could wipe out the destination. It was supposed to be a non-destructive one-way backup. That meant:

  • source on the left

  • target on the right

  • Update mode

  • no target deletions

  • no versioning

The other important design choice was the path.

Early on, a mapped Z: drive was part of the discussion. That would have been convenient for manual use, but not ideal for a scheduled unattended job. Mapped drives can be session-specific, background tasks may not inherit them, and reconnect timing can get messy after boot or wake. So the final design rejected the drive letter and used the UNC path directly:

\\192.168.10.2\exthdd\GDrive_Backup

That sounds like a small detail, but it really was not. It is exactly the kind of thing that separates “works when I click it manually” from “works reliably at 22:00 with nobody watching.”

This was also a good example of the team model working well. ChatGPT kept pushing the design toward unattended reliability. Codex documented the risk and the setup plan. I confirmed what the real environment looked like and which path was actually correct.


Step 2: Codex Built the Framework, but the Batch File Was Wrong

Once FreeFileSync was installed, Codex created the project structure inside C:\Codex and generated an initial project-side .ffs_batch file plus the wrapper script.

On paper, that was exactly what should have happened.

In reality, the first generated FreeFileSync batch file was not valid for the installed version. FreeFileSync reported that the file was incomplete and that missing XML elements were being reset to defaults. That is a subtle but important kind of failure. The file existed. It looked plausible. An AI could easily believe it had produced the required artifact. But the application itself was telling us that the configuration was incomplete and therefore not safe to trust.

This is another place where AI can sound more in control than it really is.

A lot of vendor-specific export formats are not just “XML that looks right.” They are “XML that this exact version of this exact application accepts as complete.” And those are not always the same thing.

So at that point I stepped in and created a working FreeFileSync batch file manually at:

C:\GDrive_to_Athena_Update.ffs_batch

That manual file became the source of truth. Later, the project copy was aligned with it, and the PowerShell wrapper was pointed directly to the valid working file to remove ambiguity.

That was not AI failure in the dramatic sense. It was something more normal and more instructive: AI got us close, but the application itself had the final word.

this if the final batch xml that i created via the GUI:

 

The weird part was that Codex said his and mine were perfectly equal.


Step 3: Turning a Backup Job into a Monitoring Signal

This is where the project stopped being “run a backup” and started becoming actual automation engineering.

A backup only becomes operationally useful once it produces a signal that other systems can understand.

For this project, the translation layer was a PowerShell wrapper:

C:\Codex\scripts\Run-GDriveBackup.ps1

Its job was simple.

First, launch the real FreeFileSync batch job.
Second, wait for it to finish.
Third, inspect the exit code.
Fourth, write a fixed Windows Application log event based on that outcome.

The wrapper used these two fixed strings:

DPROJECT_BACKUP: SUCCESS
DPROJECT_BACKUP: FAILED

That was intentionally boring.

And that is exactly why it was good.

Human-friendly status messages are nice for reading, but they are worse for matching. Monitoring works better when the signal is stable and machine-friendly. So the wrapper deliberately reduced all the complexity of the backup outcome into two fixed messages that other tools could reliably detect.

The relevant logic looked like this:

$process = Start-Process -FilePath $FreeFileSyncExe -ArgumentList @($BatchFilePath) -Wait -PassThru
$exitCode = $process.ExitCode

if ($exitCode -eq 0) {
    Write-BackupEvent -Type 'INFORMATION' -Id 100 -Message 'DPROJECT_BACKUP: SUCCESS'
    exit 0
}

Write-BackupEvent -Type 'ERROR' -Id 101 -Message 'DPROJECT_BACKUP: FAILED'
exit $exitCode

That is the kind of code I like in infrastructure work. Small. Legible. Purpose-built. No drama.


Step 4: Another Important Limitation — A Popup Does Not Equal an Alert

This part turned out to be more subtle than expected.

At first glance, you might think that if FreeFileSync shows an error popup, then the monitoring system should obviously send a failure email.

But that is not actually how the chain works.

LibreNMS does not know about the popup. It only knows what final message reaches it. And that message is determined by the final FreeFileSync exit code seen by the wrapper. So if FreeFileSync pauses on a popup and a human later clicks Ignore, the sync can still continue and may still return success. If that happens, the wrapper will log DPROJECT_BACKUP: SUCCESS, not FAILED, and LibreNMS will stay quiet.

That led to some of the most important corrections in the whole project.

The FreeFileSync batch job had to be saved with unattended-safe GUI behavior. In the batch settings, the risky behavior was essentially “show error message and wait.” The correct unattended behavior was “cancel on error.” I saw a message, Codex did not. I clicked ok, it went through. but Xodex just saw some sort of error output !=0.

It was easy to fix it but it was a great lesson in the limits of AI, too.

AI can help write the script and explain the logic, but it is very easy to miss one some small setting that changes the meaning of the entire system and sometimes Codex misses and sometimes I miss because I am a n00b.

Where and how to save a batch config: 


Step 5: Scheduling It for Real

Once the wrapper and the working batch file were in place, the scheduled task became simple.

Task name:

GDriveToAthenaUpdate

Schedule:

daily at 22:00

Command:

C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -NoProfile -ExecutionPolicy Bypass -File C:\Codex\scripts\Run-GDriveBackup.ps1

Run account:

Admin

That is a good example of the final architecture becoming calmer than the setup process. By the end, the scheduled task itself was not complicated at all. The complexity was in making sure the thing it launched was trustworthy.

This is another place where AI was useful, but not magical. Codex could define the task and document it. ChatGPT could keep the logic coherent. But only after the earlier mistakes were corrected did this become something I would actually trust to run at night without supervision. Codex said he copied my config into his working directory but then it died not work... you can set a root / working directory for codex so that it does not work outside of these boundaries. However I ended up using my file in mys location.


Step 6: Verifying the Signal Locally on Windows

Once the scheduled execution path existed, the next question was obvious:

Can I see the signal where it starts?

The wrapper writes into the Windows Application log using source BackupScript. Success is written as an informational event with ID 100. Failure is written as an error event with ID 101.

A quick manual test of the failure side looked like this:

eventcreate /T ERROR /ID 101 /L APPLICATION /SO BackupScript /D "DPROJECT_BACKUP: FAILED"

That gave me a known-good failure message to trace through the rest of the stack. This step matters because it isolates the monitoring path from the backup logic. If the test event arrives in LibreNMS, then the forwarding and alerting chain works even before a real backup failure happens.

That kind of staged validation is something I want more of in homelab work, not less. It is easy to get excited about “the whole thing works.” It is better to prove each boundary separately.

[ Screenshot idea: Windows Event Viewer showing DPROJECT_BACKUP: FAILED ]


Step 7: NXLog Forwarded the Windows Event Into the Monitoring Stack

The next hop was already partly in place because this PC was already participating in centralized logging.

NXLog was installed and running on Aether. Its job here was to read the Windows Event Log using im_msvistalog and forward those entries via UDP syslog to LibreNMS at 192.168.10.5:514. The relevant config file was:

C:\Program Files\nxlog\conf\nxlog.d\10-win-to-syslog.conf

That made the Windows Application log a kind of bridge format.

The wrapper did not need to speak syslog directly. It only needed to create clean Windows events. NXLog handled the translation into syslog, and LibreNMS handled the storage and alerting on the other side.

That is another example of good automation design: each layer does one job well.


Step 8: LibreNMS Became the Decision Point

Once the event reached LibreNMS, the final step was to make the monitoring system care about it.

The syslog message arrived from Aether and could be found in LibreNMS. The core matching string was simple:

msg like "%DPROJECT_BACKUP: FAILED%"

For the real rule, the logic was tightened so the alert was scoped to Aether rather than any device that might someday emit the same text. The final alert was:

Aether Backup Failed

Severity:

critical

Transport:

Mail Daniel Gmail

That was the point where the whole chain became real.

Not “configured.”
Not “apparently working.”
Real.

A controlled failure string could be generated on Windows, forwarded through NXLog, stored by LibreNMS, matched by the alert rule, and turned into an email.

That is exactly the kind of end-to-end evidence I want from an automation project.


What AI Was Good At in This Project

The strongest thing AI did here was not “click buttons on a computer.”

The strongest thing AI did was structure.

ChatGPT was useful for turning a vague requirement into a proper architecture: use Update mode, avoid deletions, reject the mapped drive, put a wrapper in front of the batch file, reduce outcomes to two fixed messages, use the Windows Event Log as the bridge, and let LibreNMS do the final alerting.

Codex was useful for execution once the path was clear: creating the project structure, preparing scripts, documenting the setup, keeping files organized, and pushing the task forward faster than I would have done by hand in one sitting. (Daniel Thiele IT)

That is real productivity.

And it is the kind of productivity I actually care about: not replacing understanding, but reducing friction around structured work.


What AI Was Bad At in This Project

The weak points were just as informative.

First, GUI awareness was limited. The FreeFileSync install problem showed that clearly. Codex could initiate the install and reason about it, but it did not really “understand” the blocked GUI state the way a human looking at the screen could. The installer was waiting in some important way, and the AI side was effectively blind to the most important part of the situation.

Second, generated config is not always trustworthy just because it is syntactically plausible or codex says it is. The first .ffs_batch file looked like progress until FreeFileSync itself declared it incomplete and reset parts of it. That meant the real authority was still the application, not the AI-generated file.

Third, AI can miss the one GUI toggle that changes the operational meaning of the whole system. In this case, the difference between “show error dialog” and “cancel on error” determined whether the backup workflow would fail cleanly and emit the right alert signal, or hang in an awkward half-human, half-automation state.

That is the practical limit of AI computer use right now. It is not useless. Far from it. But it is still much stronger at planning around a computer than it is at fully perceiving everything a real interactive desktop is doing.

All in all Codex is really good at Linux and Powershell related stuff. But the actual installation of GUI Apps plus configuration with GUI is not there yet. 


Final Verification

By the end of the project, the following things were true.

The source path existed.
The UNC target existed.
FreeFileSync was installed.
The scheduled task existed and ran the wrapper.
The wrapper wrote fixed Windows Application events.
NXLog forwarded those events to LibreNMS.
LibreNMS stored the messages.
The Aether Backup Failed rule existed.
The email transport was attached.
And I confirmed that the email alert actually arrived.

That is the level of closure I want from homelab automation.

Not just “script written.”
Not just “looks good.”
But: signal created, signal forwarded, signal matched, alert delivered.


The Real Takeaway

This project was nominally about backup automation.

But what it really taught me was this:

AI is already useful in real administration work, especially when the job involves planning, decomposing problems, writing wrappers, generating documentation, and keeping multiple layers of a workflow connected. It can reduce a lot of friction. It can turn loose ideas into concrete systems surprisingly fast.

What it still does not do well is replace human reality-testing.

When the truth is in a GUI popup, a hidden installer state, a vendor-specific export format, a session-bound mapped drive, or one tiny application setting that decides whether the whole chain succeeds or fails properly, the human is still the one who closes the loop.

And that is not a disappointing conclusion.

It is a useful one.

Because it means the right mental model is not “AI does the project for me.”

The right model is more like this:

I run the project.
ChatGPT helps me think and plan and write the blog, generate images etc..
Codex helps me execute, install configure, give technical documentation and deliverables.
The machine gives the final answer.

And my job is still to notice when the machine is telling a different story than the AI.