This is largely an issue for you to decide based on your theoretical assumptions about the data and what lies behind them. When you calculate an arithmetic average, you are assuming that the intervals are reasonably similar. (That is, you are implicitly stating that $3-2 = 2-1$ and $3-1 = 2\times (3-2)$.) If you believe that is a reasonable assumption, and others in your field (e.g., reviewers) are likely to agree with you, then it's fine. Using means with ordinal data tends to be more defensible when:
- There are a larger number of ordinal levels (a rule of thumb is $\ge 12$);
- the ordinal levels are composed of many components (e.g., ratings for many related questions are aggregated into a composite); and/or
- the raters were instructed / tried to make the ratings equal interval.
It isn't clear to me that those hold in your case, but it is for you to decide.
You also should think hard about what you mean by '"mainly" rated as 2'. Again, that is for you to decide. However, I would not think of the set of ratings $\{1,1,2,3,3\}$ as "mainly" being $2$, despite the fact that the mean is $2$. I would interpret that as being a somewhat polarizing word, with some thinking it's 'easy' and some thinking it's 'hard'. But again, this is a theoretical issue for you to decide.
For what it's worth (almost certainly very little), if it were me, I would think your ratings were not amenable to be described by means. I think I would interpret '"mainly" rated as 2' as the majority of raters gave this word a 2. That is, I would select words that received $>50\%\ \rm ``2\!"$s.
By contrast, I suspect that you don't only want to select individual words '"mainly" rated as 2', but also want the entire set of selected words to be rated $\approx 2$. To check that aspect, I would feel more comfortable using the mean of all the ratings for all the selected words (or the mean of the words' means). At this point, you are averaging over many more ratings and I think the mean would be more defensible.