Motiejus Jakštys Public Record

This conversation totally didn't happen at Microsoft

2023-12-07

Similarity to real–world events and character names is coincidental.

Characters, Microsoft employees:

  • Amy: a high-level executive. Ex-JPMorgan. Pragmatic.
  • Harry: an engineer in Developer Services team. His organization owns code hosting, developer tools and CI infrastructure. A good listener.

2015 — the beginning of Git at Microsoft

Exchange between Harry and Amy in a parking lot of a chilly Redmond morning:

  • Harry: Amy, our Skype colleagues from Tallinn have been using git since 2006 and are making fun of us for using perforce in 2015. Our ex-AWS colleagues take offense, since they know Estonians are right. In fact, everyone takes offense, because nobody likes to admit Estonians are right. Git is a tad too slow for large repos, preventing quick migration. Do you mind if I ask my team to take a look into this?
  • Amy: sure, go ahead, Harry. I don’t care about version control, do what you think is right as long as it works for everyone.

Harry starts poking at git to make it work better for larger repositories.

late 2016 — money pressure and GVFS

Harry and his team implements partial clone (later renamed to sparse checkout). With careful hand-holding, crossed fingers and during a good weather, Visual Studio can now load the partially-cloned Windows repository without crashing. Excitement grows. Friendly, congratulatory exchanges between Estonians and Redmondians take place. Engineers get excited thinking the migration is “soon”.

Harry and Amy again:

  • Amy: Harry, how’s that git thing going? I said I don’t care about version control, but for some reason I do now.
  • Harry: pretty well, why?
  • Amy: just curious, what would it take to migrate the whole company to git?
  • Harry: the tooling is robust and we are ready to migrate. One last thing — Windows and Office repositories are in the hundreds of gigabytes. About 50k people will need get their laptops’ disks replaced. Oh, and we will kill the office network while they download the initial clone. With good planning, we should be good in a month or two.
  • Amy: sounds like $20 million for the disks and lost productivity while this chaos settles down. Any other ideas?
  • Harry: our central repositories are in the basement, and the office connectivity is quite good. Maybe we can use shallow clones.
  • Amy: whatever that means. If it helps, try to make it happen.

Harry scrambles to do something about it, creates GVFS. Open sources it. Everyone understands it’s a temporary solution, so lives wit it. People use their git.

2017 — migration is over and problems with GVFS

Migration is over for the last repository. People are complaining about GVFS, but at least they are on git. Amy did not spend her political capital on procurement, so she is happy.

GVFS is open-source, but only sort-of. It requires many Microsoft assumptions (e.g. don’t even try MacOS), but companies cargo-cult GVFS and struggle with it anyway, because it’s Microsoft.

2018 — and later: github acquisition and Scalar

Microsoft buys github. Estonians no longer have anything to make fun of, so they fall back to poking the flies on their office windows. Harry has an eye on replacing GVFS.

Harry’s team keeps improving git. Rewrites GVFS to C and renames it to scalar. To take revenge of Estonians, Harry’s colleague Theodoric bets that he can put microsoft-specific code into upstream git. He wins:

https://github.com/git/git/blob/v2.35.0/contrib/scalar/scalar.c#L144

Late 2023

MS taught their developers to use scalar. Dozens of other companies who believe their repositories are big clone the Microsoft’s workflow. However, their git repositories are not in the basement of their office. So many people unknowingly pay the price of calling into github every few seconds.

The speed of light is did not change over the last decade. If your git repository is on another continent, it will still take at least 100ms for the round-trip (plus whatever outage your git provider has this minute). Cost of SSD is ~$100/TB, this keeps decreasing.

scalar.c has been “made official” and moved from contrib to top-level. But the azure ghosts are still with us:

https://github.com/git/git/blob/v2.43.0/scalar.c#L145

Takeaways

Try this if you think your repo is big:

git clone -c feature.manyFiles=true git@<...>

And forget shallow clones. Sparse checkouts are pretty decently done, so if your repository allows that, it may be a good thing to try.

Also have a look at git maintenance and git config core.fsmonitor.

If you eye a large company for a solution, think about their context. Your repository probably doesn’t weigh hundreds of gigabytes, and it will not cost $20 million to procure larger disks for developers.