Previously the Metal backend used a manual mutex system to make sure the
BufferSetSubData didn't have data races with reads from the GPU. Replace
this with a non-hacky version
- Make the Buffer objects allocated on the GPU
- Make SetSubData use a ResourceUploader that allocates a CPU buffer
and schedules a CPU->GPU copy.
- Have a list of pending commands and a finished command serial to
order operations and track when resource become unused.
No functional changes intended, but there are a couple additional
cleanups:
- Use anonymous namespaces instead of static functions
- Don't store an extra Device pointer in objects