7 July

How to find an English teacher. Part 2

PythonProgrammingData visualizationMachine learningNatural Language Processing

This is a continuation of story about using Data Science for finding an English teacher. If you have not read it yet - there is an opportunity to become familiar with it

Briefly  -  we had information about language teachers and tried to apply some basic ideas using pandas and our expectations. Unfortunately we got stuck on the third step, because there is not enough information for resolving our the last requirements  -  we need not more 3 candidates at the end.

It is an approach based on my own experience and can be unsuitable to your point of view, ideas, or principles.

Step 4. Wisdom of crowd

Well, it is a bit harder than expected…

Despite the fact of having some information about teachers  -  we could not blindly believe in it. In the previous step, we made judgments/conclusions about the experience from a text description. And it makes sense when we consider something which can be verified(more or less).

But if we want to choose «TOP 3» from 7 teachers, and we only stick with the description which has been provided - mistakes are bound to happen. The reason can be illustrated  the famous saying «Every cook praises his own broth». And there is a remote chance that people will say negative things about themselves. So that we need to add some related information like reviews. Firstly, we fetch them.

Amount of lessons

The first important moment - how many lessons did students have with a particular teacher? They could write amazing reviews, but will not book classes any more. I presume it is a very correlated moment with the experience of a teacher. Let's have a look at it.


According to an average number of lessons, a person from South Africa has got more loyal students than other.

We need to go deeper…

On the one hand, people like to write a review and language learners as well. On the other hand - these reviews very frequently are simple and look like «A good teacher», «Thank you», «The best lesson», etc.
And… to be honest - the majority of reviews are similar to each other and are not seductive for us. We can try to classify it(sentiment analysis) but it still would be more about emotions than facts. Let's look at this from another perspective.

Instead of reading excited(or disappointed) reviews - we try to understand the structure. Before we worked with text by using regular expressions. Shall we use something more powerful for reaching our goal?

Perhaps… some NLP will harm nobody.

I believe that a good review ought to contain a variety of lexicon and vocabulary. Things like «Present/Past perfect», «Conditionals», «Past participle», «Future tense» are good signs, which point at the ability of students to express their opinion in an eloquent way. 

«Adapted complexity of speech» represents how students of a specific teacher could express their thoughts.

In other words, we try to handle it in an unusual way, ignoring «what exactly do people write?», or "what do they feel about a lesson".  

Instead of it, let's try to understand «HOW their speech was expressed» and to find a teacher who has the most sophisticated reviews. 

And then, rearrange teachers by this principle.

Students of teachers from unknown country(ZZ) and Greece(GR) tend to write more complex reviews than others.

Step 5. Time for big guns.

It is a well-known fact that students want to learn from the best teacher. However… I would rather learn from teachers who taught guys gained the B2 level or upper. And in my opinion, we could assess the level of students from their reviews.

There is an amazing dataset of information -  EF-Cambridge Open Language Database. And the point is that it contains a huge set of combinations «an expression — a language proficiency level»

About levels of language proficiency
According to the Common European Framework of Reference for Languages (CEFR) there some levels of proficiency of learners. Basic(A1,A2), Intermediate(B1,B2), Advanced(C1,C2).

We will use an LSTM - network, which has been trained on this dataset (it is a completely different story...). The main idea behind it - «an automatic classification of English learner proficiency» on text.

Let me show you an example, how it works with a piece from my last essay:


Okay, let's try to do the same in real life.

And we tend to shift our score to students who have more classes with the specific teacher. Usually, people who booked/took many lessons would prefer to write extended reviews than people who had only one lesson. 

And after that look at result this approach:

Ooops… there is a terrific gap between the teacher from South Africa(ZA) and others

Step 6. Visualization+analysis

It is time for the integration! We have the teacher who is dramatically better by «opinion» of the ML-model, but…instead of relying on only this metrics, we could combine different scores, set a rank for teachers, and then range them by this integral value. This idea very similar to «a ranked voting system».

The red square for highlighting teachers who fit more than others to an initial request.

Well… Teachers from an unknown country, South Africa and Spain are the most preferable according to our calculations.

So… we are ready to go back to our expectations and to nail the last one.
A price no more than $20 per hour.
Teachers able to help me in preparation for Cambridge Exams(FCE)
They have real experience
Agree to give me homework and check it
No more than 3 candidates

Did we arrive at our finish point? Definitely!


Particularly for me, it was a little sad, that the teacher from Russia was sixth in this improvised race. At the same time, it is only this specific contest, and in any contest will be leaders and outsiders.

The main thing I wanted to say - I hope these ideas and approaches would help you find a good online teacher using data science and ML.

At the same time, this research (or maybe «an investigation»), not only about looking for a language teacher. It is more about how data hidden behind a user-interface, could help us make a choice based not only on visible metrics but on something beneath it.

P.S. Thank you for reading. There is the final version of Ipython-notebook
Tags:nobody read tags
Hubs: Python Programming Data visualization Machine learning Natural Language Processing
363 1
Leave a comment