The Robot Over My Shoulder

In a recent announcement at Microsoft Build, the CEO told the world what everyone knew but no one wanted to say out loud: AI requires your data. Windows is getting a new AI assistant feature to help you remember what you were up to and search through whatever you did on your computer; this includes games you played, who you were on a call with, or what you might be interested in doing next. Sounds fantastic! The digital assistant that Silicon Valley has been dreaming of since HG Wells is finally available to the masses. And the first question on everyone’s mind was, according to numerous articles, blogs, and even my local morning radio DJs was, “Wait, how does it know all this about me?” followed quickly by “What the hell?” Hopefully I can help shed some light on what kind of hell this is turning out to be.

How Does AI Work?

I have no insider knowledge on how Microsoft is rolling out AI into Windows, only an understanding of the fundamentals behind the technology that these tools have been built on since the initial release of ChatGPT. In general, and at a very high level, the AI works like the auto-complete in your mobile keyboard of choice. The key difference is that these AIs can “know about” more than just text; they can “see” and “learn about” sound, images, or really any kind of information that that be stored in a structured way in a computer.

How a Robot Learns

At its core, all machine learning is an exercise in trying to predict what comes next. In your mobile keyboard’s auto-complete, this is accomplished by a very simple method of prediction called a Markov-chain. Think of it as a big excel sheet with words in column A, and then in each column after that, another word and the probability that it comes after the word in column A. Now have every row be a word in the English language.

AI training is the process where we figure out what those probabilities are. That would take a while to determine, right? You’d need to think about all the combinations that words would appear in. We could feed it all the books, newspapers, and webpages you could find. It’s daunting, but doable.

Note

Key Terms

The books etc are the “Corpus” or library of source material
The Excel sheet is an “AI model”
The formula is an “AI Assistant”

In the end, we will have our excel sheet. Since we’ve calculated all these probabilities once, we can save the sheet and use it again and again without needing a rebuild. Now we can get to the predictions and figure out a formula for “given any word, give me a word that’s likely to come next.”

You may have noticed some major limitations with our Excel sheet example, though. We can only check what word might follow some other word. What about the case where you’ve been texting “I love” to someone and auto-complete helpfully suggests not only “you” but also “pizza,” “puppies,” or “your ex’s name”?

Simple Markov Chain

graph TD
  A[I] -->|0.5| B[love];
    A -->|0.4| C[want];
    A -->|0.1| D[think];
    B -->|0.9| E[you];
    B -->|0.1| F[your ex's name];

Context Matters

Clearly, “love” has different meanings based on context and we need more than just one word to figure out the next appropriate word. But now we’re going to need an Excel sheet for every word-A, a row for every word-B, and columns for all the possibilities. Our Excel file just got massive. What about three words? 20? Ok, we’ll need a different approach.

In step the current AI models. At their core, all of them are prediction models based on a corpus of previously assembled data that can “look back” much, much further on more than just words. They are fancier and beautiful mathematically, but the models are big sets of probabilities that one thing might follow another.

All the investment and media coverage around how these AIs are trained is rooted in what the “things” are and what content was used to figure out those probabilities. The concerns around privacy are (rightly) rooted in whether the proprietary information you en trusted too a company is now a “probable next thing” in a stranger’s document. The fears about hallucinations and AI malpractice are the same as accidentally getting your ex’s name sent to your spouse.

Enter the Assistant

The assistant, or agent, or robot, or digital therapist is the Excel formula on steroids. It can not only take in what you’ve written so far in your text, but what time you’re writing it, where you are, what apps are open, who you’re talking to, where they are or might be, the weather, the news, your temperament, and so on. This one-time snapshot is the “one-shot” colloquially. Take all that context, feed it into your formula and it gives you a likely next word (or paragraph).

Hypothecial AI Assistant

flowchart TB
subgraph Sources

  
    A(📝 Current Text)
    B(🕗 Time)
    C(📍 Location)

    direction LR
    D(📱 Open Apps)
    E(👥 Contacts)
    F(⛅ Weather)

end
  
A & B & C & D & E & F --> G
G(🧑‍💼 AI Assistant)

G -->|Produces| H(💬 Next Word Prediction)

Note that we never saved that context. The model, or “excel sheet,” didn’t change. Nothing has to be saved to get the result we are after.

But we could.

Throwing a Robot out with the Bath Water?

Hopefully, I’ve helped provide a solid basis for how these AI tools work in general. The devil, as they say, is in the details. And with so much data, it’s details all the way down.

Sourcing Locally is Usually Fine

As with farmers markets, the closer you get to the source, the more you know about where your data comes from and how it’s been processed. When a company like Microsoft says that “the model is on your computer,” what they mean is the Excel sheet is saved on your hard drive somewhere. When they say, “it will know about what you are doing and what you’ve been up to,” they are saying “we’re going to use things like system logs, documents, and files that already exist, plus what’s on your screen as inputs to the formula.”

As stated, that’s usually fine. But, what if, say, you wanted to keep the processing of all those probabilities, based on everything your computer can offer, for later? We’d probably not like our laptop battery draining every 10 minutes as it crunches numbers. That’s where things get tricky.

What is Saved and How

To recap, we have the base model on our hard drive along with all our files provided to us by a vendor like Microsoft. We’d like to personalize it with the content that is available on our computer. So that’s what we do. We create a new “baby” model based only on the content on the machine. Then we update our assistant formula to use both the base model and our baby model when making personalized predictions.

But that baby model needs to be secure. It has a probability record of everything on our computer. It has to be secure.

Thankfully, most modern devices (and specifically ones that run Windows 11) have what’s called a TPM (trusted platform module) that uses public-key cryptography to encrypt and decrypt files like our model. This is unique to our device and should keep the baby model secure from all but the most determined state-sponsored hacker.

An "ideal" AI set up

%%{ init: { 'flowchart': { 'curve': 'basis', 'htmlLabels': false,  } } }%%
flowchart TB
    M -->|"`Searches`"|V
    V -.->|"`_Maybe also_ Provides`"| C & TPM

   
    subgraph C["`**Our Computer**`"];
        AI -->|"Uses"|M & B
        subgraph E["🔒 Encrypted"];
            B -->|"`Searches`"|D
            D -.->|"`'Learning'`"|B
            B("`🤖 'Baby' (local) Model`")
            D[("`💾 Our Personal Data`")]
        end
        TPM("🔐 TPM") -->|"Encrypts"|E
    end
    Me("🤓 Me") -->|"Asks"|AI
    V("🏢 AI Provider") -->|"Provides"| AI("🧑‍💼 AI Assistant") & M("🤖 Vendor Model")

But Microsoft

Here’s where the public writ large seems to bristle at the Built AI announcement, even if they aren’t sure exactly why. If I may be so bold as to speak for the layman, the issue lies with all the “yeah but"s:

“Yeah, but companies like Microsoft and Apple also make the devices these things run on, so how could we trust they don’t have access to this TPM?”

You must trust them based on their published documentation and marketing.

“Yeah, but can’t they just send the saved file back to headquarters?”

They absolutely can and do. It’s usually based on the privacy policy or user agreement which is miles long and written (deliberately in too many cases) in legalese like a contract (which it is).

“Yeah, but won’t they train the models they send out with my model?”

They aren’t supposed to unless you’ve agreed to it.

“Yeah, but how did they train it to begin with?”

Using scraped internet websites (under litigation), bulk-data sales (now illegal in California), metrics and content from their free-tiers of services (usually allowed in lieu of cash for continued use). The list goes on (and can you see a pattern)?

“Yeah, but how are they going to keep this base model updated?”

An excellent question. Does it need updated? If so, what’s going to feed in the future if we, the consumers, start exercising our rights to restrict that use? Won’t the models stagnate? It remains to be seen.

Conclusions

Consumers

Question the use of your data. Data about you, your family and loved ones is being used by these companies offering services. Some of them are taking your concerns seriously with dedicated people keeping these risks at heart while still offering great AI services, but not all of them. Maybe not even a majority of them.

Founders

Security and privacy is a strength in selling. If early morning DJs in Indiana are talking about it, your customers are thinking about it. It is a boon to have an upfront and honest commitment to your ultimate source of income about how you use the data they entrust to you.

Product and Security Leaders

Know what you are requiring of the users to make your amazing features work and why. Let your users know about it in a proactive manner, not hiding it behind a labyrinth of micro-service options that change every full moon.

As I have advised since GPT mania in 2022:

You wouldn’t let your keyboard auto-complete do anything on your behalf without your oversight. Don’t let “AI.”

Editors note: Revised to add visualisations for clarity