Since the launch, many users have been requesting a freely typed technology filters on RemoteBase. I have recently built it.

RemoteBase used to have a fixed number of technology filters for remote companies. But everyone uses all kinds of different technologies. To get a customized list of companies that best fit our skills and tastes, we need more flexible filters.

With this notion in mind, I recently implemented a freely typed technology search on RemoteBase. Let me share how I implemented the feature, and how it works under the hood.

The original version

In the beginning, there were not much data to filter. So the filters were limited in number. The screenshot below shows an early version with only 8 choices for technology filters.

a screenshot of the original version
An original version of RemoteBase

At this time, having a limited number of filters made sense, because adding new filters for all available data could actually diminish the usefulness of the whole filtering functionality.

For instance, if there are only one or two companies in the database using .NET, adding .NET filter would not have made filters more useful, because when used together with other filters, they might make selection too narrow and would not return any result to the user.

But an increasing number of companies were being listed on RemoteBase, and data started to mature. Now it made sense to expand the filters.

First iteration

a screenshot of the first iteration
The first iteration

I came up with the above implementation after working for a day or two. The input had an autocompletion built-in, so that the suggestions would update as the user typed the query.

For hours, I tried to reinvent the wheel by creating an autocomplete component from scratch. But it turned out there were too many edge cases to account for. So I ended up forking a React autocompletion component from GitHub and customizing it.

There were a couple of problems I had to tackle:

Normalizing the data

There had not been a relationship between companies and technologies in the database. Instead, all technologies for companies were hard coded as an array of strings. It is not efficient to implement autocompletion with this setup, because there is no static source of truth for all suggestions.

So I wrote a script to loop through all companies and establish a relationship between technologies and companies.

Companies used to look like:

{
  name: "foo",
  technologies: ["C", "Node.js"]
}

And after running the script, it looked like:

{
  name: "foo",
    technologies: [
      {
        _id: "some_id"
        name: "C"
      },
      {
        _id: "some_id_2",
        name: "Node.js"
      }
    ]
}

With a separate collection for technologies:

{
  _id: "some_id"
  name: "C"
},
{
  _id: "some_id_2",
  name: "Node.js"
}

Eliminating duplicate data

Since the technologies were denormalized at first, there were duplication when I normalized them. For instance, when a user typed nod, four different results would come up: node.js, Node.js, NodeJS,and nodejs. But they are all referring to the same technology.

So I wrote a script to loop through all companies and their technologies, find similar ones, and replace duplicates with similar ones, if any. While at it, I also applied the same code to collaboration_methods, and communication_methods, and migrated those fields too.

from pymongo import MongoClient
import re

conn = MongoClient('localhost')
db = conn.remotebase

def try_find_or_create(item_name, key):
    name = item_name.title().replace(" ", "")

    regex = re.compile(re.escape(name), re.IGNORECASE)
    similar = db[key].find_one({'name': {'$regex': regex}})
    exact = db[key].find_one({'name': name})

    if similar:
        return similar
    elif exact:
        return exact
    else:
        db[key].insert_one({'name': name})
        return db[key].find_one({'name': name})

def migrate(company, key):
    for item in company[key]:
        item_name = item if isinstance(item, str) else item['name']

        source = try_find_or_create(item_name, key)

        # pull the original data
        if isinstance(item, str):
            db.companies.update_one({'name': company['name']}, {'$pull': {key: item_name}})
        else:
            db.companies.update_one({'name': company['name']}, {'$pull': {key: {'name': item_name}}})

        # add the new data
        db.companies.update_one({'name': company['name']}, {'$addToSet': {key: source}})

for company in db.companies.find({'name': 'rmotr.com'}):
    print('migrating', company['name'])
    migrate(company, 'technologies')
    migrate(company, 'communication_methods')
    migrate(company, 'collaboration_methods')

This script did not filter possible duplicates because it simply relied on regex to find similar items. So I skimmed through the technologies, manually identified possible duplicates, and wrote another script to replace them.

from pymongo import MongoClient
import re
import sys

conn = MongoClient('localhost')
db = conn.remotebase

# dictionaries with duplicate as keys and preferred as values
technologies_map = {
    'Nodejs': 'Node.js',
    'ASP.Net': '.Net',
    '.NetFramework': '.Net',
    'Postgressql': 'Postgressql',
    'MysqlAsDb': 'Mysql',
    'Rails5': 'RubyOnRails',
    'Rails+Node': 'RubyOnRails',
    'Meteorjs': 'Meteor',
    'Meteor.Js': 'Meteor',
    'Objective-C': 'ObjectiveC',
    'Python3': 'Python',
    'React.Js': 'React',
    'Reactjs': 'React',
    'Angularjs': 'Angular',
    'Angular.Js': 'Angular',
    'Android+Ios': ['Android', 'Ios'],
    'MonogoDb': 'MongoDB',
    'Mobile': ['Android', 'Ios'],
    'WebAndMobile': ['Html', 'Css', 'Android', 'Ios'],
    'AndroidSdk': 'Android',
    ".Nodejs": "Node.js",
    "with Postgres": "Postgressql",
    "Express.Js": "Expressjs",
    "OurStackIs:Linux": "Linux"
}
communication_methods_map = {
    'Skype.': 'Skype',
    'Ghangouts': 'GoogleHangout',
    'Hangouts': 'GoogleHangout',
    'G+': 'GoogleApps',
    'PhoneCalls': 'PhoneCall',
    "Hipchat": "HipChat",
}
collaboration_methods_map = {
    'Gdrive': 'Googledrive',
    'Gdocs': 'Googleapps',
    'Githubissues': 'Github',
    "Googledrive": "GoogleDrive",
    "DropboxAndGoogleDrive": ["Dropbox", "GoogleDrive"],
    "Dropboxpaper": "Dropbox",
    "TrelloAndGoogleDrive": ["Trello", "GoogleDrive"]
}

migration_map = {
    'technologies': technologies_map,
    'communication_methods': communication_methods_map,
    'collaboration_methods': collaboration_methods_map
}

def add_source_to_key(company, key, source_name):
    preferred_source = db[key].find_one({'name': source_name})
    if not preferred_source:
        db[key].insert_one({'name': source_name})
        preferred_source = db[key].find_one({'name': source_name})
        db.companies.update_one({'name': company['name']}, {'$addToSet': {key: preferred_source}})
    else:
        db.companies.update_one({'name': company['name']}, {'$addToSet': {key: preferred_source}})

def remove_duplicate(company, key):
    key_map = migration_map[key]

    # Iterate on all items and change duplicate references to preferred source
    for item in company[key]:
        for old, new in key_map.items():
            if item['name'] == old:
                db.companies.update_one({'name': company['name']}, {'$pull': {key: {'name': old}}})

                if isinstance(new, list):
                    for source_name in new:
                        add_source_to_key(company, key, source_name)
                else:
                    add_source_to_key(company, key, new)

    # clean up the duplicate sources
    for old in key_map:
        db[key].delete_one({'name': old})

def main():
    companies = db.companies.find()
    for company in companies:
        print('Migrating', company['name'])
        remove_duplicate(company, 'technologies')
        remove_duplicate(company, 'communication_methods')
        remove_duplicate(company, 'collaboration_methods')

if __name__ == '__main__':
    main()

Now the technologies, communication_methods, and collaboration_methods were mostly duplicate free, and the free search was finally useful.

I loved using Python when writing these migration scripts. I found Node.js kind of hard to reason about with this kind of task. I also tried to write Ruby, but working with MongoDB was kind of awkward in Ruby syntax.

Async fetching

In this first iteration, the autocomplete feature added some time to the initial page load because 100+ sources were loaded along with the app. When a user typed a query, the autocompletion happened on the client side using the preloaded sources.

It would have been better if there was an API endpoint that responded with suggestions, so that the app did not have to load all the sources in the inital rendering.

Second iteration

Dealing with the challenges above, I shipped the second version of technology search.

a screenshot of the second iteration
The second iteration

This time, I made an API endpoint that responded with matching suggestions based on the user input. All the suggestions are asynchronously loaded as user types a query. I did not measure how much loading time was saved due to this improvement, but it kind of feels cleaner.

At the moment, the autosuggestion is based on regex matching. Here is the actual code I wrote:

getMatching(req, res) {
  const type = req.params.type;
  const candidate = req.query.candidate;
  const regex = new RegExp(escapeRegExp(candidate), 'i');

  const typeMap = {
    technologies: Technology,
    communication_methods: CommunicationMethod,
    collaboration_methods: CollaborationMethod
  };

  const model = typeMap[type];

  if (!model) {
    return res.status(500).end();
  }

  model.find({ name: regex }, {}, { limit: 10 })
    .then(items => res.json({ items }))
    .catch(err => {
      console.log('Error', err);
      res.status(500).end();
    });
}

Also, the autosuggestion input is using react-autosuggest by @moroshko. It was very extensible out of the box. I think this library is an example of level of generalization/abstraction that open source libraries should live up to.

What’s next?

At the moment, users can only select a single technology to filter companies by. But would it be useful if we can select multiple? I personally think that such a feature is an overkill for now. But it may be useful in the future.

I could also improve the autosuggestion algorithm. Currently, the technology filter is not truly an autosuggestion, because it uses a simple regex matching. For instnace, when a user types ‘node.js’, RemoteBase probably should also suggest JavaScript. Maybe I should calculate the ‘relatedness’ of keywords and return the highest matches.

All in all, it feels good to ship a fun, and much-needed feature. I hope free technology search will make RemoteBase more useful to all developers out there.