PatSeq Finder

The PatSeq Finder is a sequence similarity search tool based on BLAST, allowing you to search the Lens patent sequence (PatSeq) databases for matches to a sequence of your interest.  This tool is unique since it enables you to conduct sequence-based searches within more than 250 million patent sequences that we serve in either a nucleotide-based or protein-based databases.

You can find the PatSeq Finder here: https://www.lens.org/lens/bio/patseqfinder. For a demo, see PatSeq Finder Search and result pages

About BLAST

The PatSeq Finder uses the BLAST (Basic Local Alignment Search Tool) algorithms to match the query with its results. Maintained by NCBI, this commonly used set of algorithms has a lot of flexibility and a number of parameters. BLAST works by finding regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches based on expected value, similarity and coverage scores. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. For more information on BLAST see this Nature Paper, this Wikipedia page, and some recommendations for parameters here.   BLAST version 2.2.31 is currently used in PatSeq Finder. By adapting statistical thresholds and word size, a close-to-optimal alignment can also be achieved with BLAST. Thus, in this upgraded version of PatSeq Finder, we cater for that and enable a new auto-optimisation feature for shorter sequences, which are mostly affected by the BLAST parameter selection.

While other non-heuristic search methods, like dynamic programming (e.g. Needleman-Wunsch algorithm for global or Smith-Waterman for local alignments) guarantee to find a optimal alignment, they don’t perform very well on extremely large datasets. Additional algorithms are on our road map for more advanced versions of this tool.

Page Options

Sequence Input

At the top of the page is the sequence input section where you can enter a sequence in FASTA format. FASTA is a standard format for biological sequences. You can also upload a sequence file and optionally select to only search for part of the sequence. In this upgraded version, you can name and save your searches, but you will need to log in into the Lens before you start your search.

Database Selection

Next you need to select the appropriate database to search in based on sequence type you have pasted above.  If you entered a DNA or RNA sequence, you would select the nucleotide PatSeq database but if you entered a protein sequence, then you would need to select the amino acid PatSeq database. If you enter PatSeq Finder from the sequence tab, the appropriate database is automatically selected.

BLAST Options

In BLAST options you set the basic type of BLAST search you wish to make. You can select from:

  • blastn – nucleotide query vs nucleotide database
  • blastp – protein query vs protein database
  • blastx – translated nucleotide query vs protein database
  • tblastx- translated nucleotide query vs translated nucleotide database
  • tblastn – protein query vs translated nucleotide query

You can also optimise your search based on how similar you expect them to be. When searching for sequences which match each other you will want to keep it on the default “Highly Similar Sequences (>95%) (megablast)”.

Parameters

Standard NCBI scoring BLAST parameters are set for you as defaults and you do not need to worry about these in a general search.  If you use very short sequences, we automate BLAST parameters to enable you to conduct such search automatically. To understand the recommended parameters for short sequences, please see below.

The default general parameters allow you to see 500 hits now.  However, you can choose up to 20,000 hits, by simply changing the “Maximum number of hits to Show” in the advanced options menu available.

Searching

Once you have checked your search parameters and labeled your search, you can click the Submit Search button to be taken to the PatSeq Finder Result Page. Please note that this process can take up between 30 sec-20 min or longer for very large sequences and during high load, your query may fail to complete.  You may want to check to make sure you have selected the appropriate database to search in and if you experience some delay, please come back and try your query later and send us some feedback.  We are currently working on improving our server to accommodate more users and the rate of searches one can perform at any one time.

Frequently Asked Questions

1. Is there a minimum length of sequence under which Patseq Finder will not perform a BLAST search?

We don’t have any imposed limitations and are using the default NCBI BLAST parameters.  For short sequences we recommend:

  • deactivate the “low complexity filter”
  • use the blastn algorithm (3rd option) – the default megablast algorithm has a word size of 28 is too “coarse” for your short sequences.
  • increase the E-value to 100 or higher. (“The lower the E-value, or the closer it is to zero, the more “significant” the match is. However, keep in mind that virtually identical short alignments have relatively high E values. This is because the calculation of the E value takes into account the length of the query sequence. These high E values make sense because shorter sequences have a higher probability of occurring in the database purely by chance.”)
  • reduce the size of the initial search seed (word size) – default for blastn is 11 you could use 7 instead to increase specificity for your short sequences.

2. What are low complexity regions? Why should low complexity regions be filtered?

Regions of a sequence containing few kinds of elements are called low complexity regions and including them may lead to misleading results. These regions should be filtered out to allow the program to find the significant and related sequences in the database. For more information on low complexity regions see:

http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ#LCR

3. How should I set general parameters in a BLAST search?

For guidelines on BLAST search parameters, see: http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml   (see section B: BLAST search parameters)

4. How should I set scoring parameters?

“Reward” and “penalty” is a scoring system constituted by a “reward” for a match and a “penalty” for a mismatch. For more information see:  http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Reward-penalty

The substitution matrix is an element of the calculation of the bit score. The bit score indicates how good the alignment of the query is to the hit. The BLOSUM62 matrix is the default matrix for BLAST programs, except for blastn and MegaBLAST. The latter two perform nucleotide-nucleotide comparisons and do not use protein-specific matrices. For more information see: http://www.ncbi.nlm.nih.gov/books/NBK21097/#A611

5. How can I save a BLAST search? How can I embed a BLAST search result?

Once you have performed a BLAST search, the url used to view the results can be saved and bookmarked to view at a later time.  Alternatively, you can download the search results in in plain text, HTML and XML formats and save the results.  Using the embed link in PatSeq Finder Results page, you can easily embed your BLAST search results

6. How may I combine the search results of two sequences?
We currently do not support combining the results of multiple sequence searches but have it on the roadmap for development.
In the meantime, we would recommend using the Excel export feature for bulk processing and then combining the exported files.
Apply filters on PatSeq Finder results (optional)
Click on “Export results…” on the left hand side
Select the fields of interest (“Patent publication number”)
Click on “Export as Excel” which should download the Excel file.
Perform steps above (1-4) for all remaining queries that you want to combine
Determine the common publication numbers in the exported files.
There are multiple ways for doing that, for example: filter-common-values-from-three-columns-in-excel
or programmatically using command-line tools line sort/join

Export

If you want to create a Lens patent collection with the common publications, you can use the “import” feature in your Lens work area and just paste the list of “patent publication numbers” for the export file above.

import

7. PatSeq Finder is not working, help!!

if you face an issue with PatSeq finder, please provide the full search url that is displaying the error message: e.g. https://www.lens.org/lens/bio/patseqfinder#results/23d58224-54ad-4b56-9e49-7f185aec74e0 to support@cambia.org
Be aware, that we are utilising NCBI Blast for our sequence searches. This tool requires valid FASTA formatted input sequences – i.e. a single-letter amino acid or nucleotide query sequence.
You can find more details at wikipedia/FASTA format or NCBI Blast.

8. I am trying to find out when sequences from certain patents were added to the database of sequences searched at lens.org using the PatSeq Finder function. Is that information available? .

We currently do not capture the date when a sequence listing was added to our database. We provide the date a sequence listing was made available publicly in our data sources. You can find this information in the tooltip on the top right of the sequence listing page. We pull sequence data from our feeds on a monthly basis, so for recent patent publications the date that a sequence became available in the Lens won’t be too far of the publication date in the source.

9. is there a way to find out which patents’ sequences are indexed by PatSeq Finder?

Doing an empty search in PatSeq Text will list the patents that contain sequence listings: https://www.lens.org/lens/search?q=&sat=N%2CP
All these sequences are searchable in PatSeq Finder.