If you’re just trying to remember the pronunciation and meaning, I’d simplify the image. I think it’s best to not try to offload every small detail into the images, but mainly use mnemonics to fill in places where natural memory fails.
I’d use a picture of “moat” and imagine someone raising something out of a moat. Every time I see an image of the moat during recall, I’d say “moat…ageru”. That would train my auditory memory to complete “moat” as “motageru”, and the image would tell me what it means. The main purpose of the moat image would be to get me past the tip-of-the-tongue effect when trying to remember the word. If I had trouble recalling the “-ageru” part of the word, then I’d add another image to the scene later.
Other people here might stop by with alternate suggestions. Here are some related discussions that might be interesting in the meantime: