Octopart helps you find electronic components, but the way we do that has been somewhat mysterious. You type in a part number and we return information about pricing and availability for relevant parts. If you’ve ever wondered how our search mechanisms work, we are starting a series about how Octopart search technology works. I believe that when you understand how a tool works, you can use it more effectively. In this post, we’ll focus on manufacturer part number (MPN) search in particular.
When you search for
18F4550 how do we know that Microchip PIC18F4550-I/PT is a good match?
At first glance this task seems simple: scan all MPNs in the database for exactly the series of characters in the search terms. But because that there are over 42 million parts in our database and searches need to complete in a few milliseconds, scanning every part would take way too long. On top of that, there are other complicating factors: differing symbols and letter-cases shouldn’t remove a part from the results, but a match with the correct symbols might be a little better than without. Also, part numbers that match at the beginning tend to be better than ones that match in the middle or end. The goal is to return parts that best match your search, not just part numbers that share remote similarities.
To accomplish all that with enough speed, search engines like Octopart do a lot of analysis before a search request is even issued. This involves breaking part numbers, descriptions, and manufacturer names into many different representations.
One of the core Octopart features is part number search. In particular, we decided that being able to search by part of a part number, in addition to a full part number, is important default behavior. To find
Microchip PIC18F4550-I/PT I don’t always want to type the full number,
PIC18F4550-I/PT. Instead I’d like to type just the beginning of the number,
pic18f4550. Or I might just recall the middle:
18f4550. To accommodate these uses, we built MPN search that, by default, allows searching for any three consecutive letters and numbers in an MPN. No wildcards, extra symbols, or special syntax is required.
This is done by using n-grams in analysis when indexing parts and while serving a search request. An n-gram is a sequence of n consecutive characters from a text. When indexing a new part, our search engine splits the MPN into sequences of three consecutive characters. These terms are stored along with the part document.
As an example, Microchip PIC18F4550-I/PT is the full part number in our database. We start with the full part number in all capital letters:
Then in lowercase and without symbols:
Then generate the 3-grams:
When you issue a search request, the same analysis process happens for each search term. Continuing our example, consider searching for
We make it lower-case, remove symbols, and generate the 3-grams:
Since all of those terms appear in the set of terms generated for
PIC18F4550-I/PT, it’s a match! This parallel analysis process, both on the MPNs when parts are indexed, and on the query when you do a search, gives us the correct behavior: you can search by any three consecutive characters in an MPN. It’s also fast.
However, this is only part of the solution. We’ve learned that people search for the beginning of part numbers more often than the middle or end. This makes sense: MPNs often contain a part family or series at the beginning and less-important packaging information at the end.
To return results that favor the beginning of the MPN, our analysis also involves generating variable length n-grams from the leading edge. For the MPN
MSP430F1232IPW, we generate:
msp430 and so on
Using these terms we can prioritize parts that match searches at the beginning of part numbers.
We also give special treatment to searches with symbols and searches for complete part numbers — those are additional signals that determine the ranking of part results. Finally, we calculate a number of other factors, like part availability and the amount of data we have on each part, to determine the popularity and importance of a part. This is used to determine the order of results. All this analysis is done in our Elasticsearch cluster, an open-source search engine.
Most of the time, part number search on Octopart works out-of-the-box. You type or paste a part number into Octopart and get what you’re looking for. But sometimes, when searching for a very general number you get too many results. Or you want to filter by a specific series of capacitor. For these cases we have an advanced search syntax that supports wildcards and excluding terms.
When you add a wildcard, either ? or *, to your query, the term with the wildcard can only match on the MPN or SKU field. ? matches any single character, * matches zero or more characters. Since partial part number matching is the default, you can use a wildcard to force a prefix-only search.
lv40 finds parts that contain lv40 anywhere in part number, description, or other fields. Searching for
lv40* finds only parts that start with lv40.
As a more complex example, the query
CC0603C?NPO?BN* returns a series of surface-mount capacitors made by Yageo. The first question mark allows any packing style, the second question mark allows any rated voltage, and the star (*) allows any capacitance value. I used the part number information in the datasheet to determine the meaning of the characters in the part number.
So, that’s how MPN search works. This is just a glimpse into how we organize our data and how our search functions, though. In upcoming blog posts, we’ll explain how we look at what happens when you search for text (like a manufacturer name or a category description) and when you search for multiple terms. In the meantime, let us know if you have any questions and head to octopart.com to keep searching.