Interesting, if true. A naive application of the described approach (assuming no rolling shutter trickery) would sample one point on the edge of the visual reactor, and interpret the deviation of its position in each frame as a (scalar) amplitude. Clearly under such circumstances Nyquist's Theorem would apply, and the highest frequency that could be captured faithfully would be half the framerate. Doubling that would require getting more data out of each frame, which seems like it would be easy under just the right circumstances but nigh impossible otherwise. One approach would be to sample two visual reactors, yielding two samples per frame with their effective times differing by the amount of time it takes sound to get from one to the other. This would be easy to do, but you would need sample sources at the right relative distances. 54 cm would turn a 300 fps framerate into 600 samples/s. A higher or lower difference in distance between visual reactors and sound source modulo 108 cm (assuming 300 fps) would yield lower-quality results, with times between samples alternating between two different values. You'd want to normalize the two sample sets to the same volume to avoid artifacts at the frequency of their offset.(if you want to reconstruct human speech at 300Hz, you preferably want to capture at 300 fps or higher)