RTP timestamp is generally telling you the number of media samples in that packets per media unit (fps or ms)
For MPEG video is, it is generally calculated by
clock rate / fps
That is if the clock rate for video is 90000 Hz for 30fps
90000 / 30 = 3000 sample per frame
In general, audio is
(clock rate / 1000) * duration of audio
for example a G.711 audio with 8000Hz clock rate and your packet has audio of 30ms duration
Then it will be (8000 / 1000) * 30 = 240 samples in 30 ms
This work because G.711 is a 8 bytes per sample codec with 8000 samples per seconds.
For AAC, it is a bit different. It will come in either 1024 or 960 samples per rtp packet.