Sunday, June 29, 2014

Why my encrypted value is different each time?





A few days ago, one guy at work asked me when using a crypto wrapper that I have written, why the encrypted value is different each time? For that crypto wrapper, I was simply using the C# RSACryptoServiceProvider [1], with a cert file as public key and a password protected pfx file containing the private key.


Well, it uses RSA [2] algorithm, and I think my colleague must have learned this at school, long time ago ;). The standard text-book RSA algorithm does not use padding so the encrypted value gets generated is same each time. That pretty much makes RSA not secure in real world because hackers can try different combinations to reproduce the same value instead of brute force solving what are the two prime numbers.


With that said, let’s review the basic of RSA in case you get asked some related questions in your interview. I am not a security expert in any means, just the most basic stuff.


1) Find two prime number, prefer large one, more than 100 bits, for example, P and Q, and let
   N = P * Q
M = (P - 1) * (Q - 1)
2) Find a number E that is a relative prime of M, meaning the Greatest Common Divisor (gcd) of E and M is 1.
3) Find a number D, so that E * D mod M = 1.


With the above, now you have a public key E to encrypt your data, and you have your private key D to decrypt. N is also public.


For example, now you want to encrypt data X,
X^E mod N = Y
Your encrypted data is Y.


According to Fermat’s little theorem, you could decrypt Y with private D using
Y^D mod N = X


It is proved that without D, it will take extremely long time to figure out the two prime numbers given N. Until this point, if you are using the above method, the encrypted value should be the same every time using public key E. This is where padding coming into picture.


Take OAEP (Optimal Asymmetric encryption padding) Padding [3], it adds different numbers to the X before encrypting it so that the encrypted values are different each time. In decryption, it removes those values.


An interesting article written (in Chinese) by Dr. Jun Wu, while he was still at Google. If you can read Chinese, highly recommend you to read it.
http://www.kuqin.com/math/20071204/2796.html . His original blog was shut down, this was saved by some other people, you can also take a look at his book “数学之美” [4]

Reference:
[2] RSA:
[3] OAEP
[4] 数学之美

Monday, June 23, 2014

[电面基础]: Java基于Hashtable的问题讨论 哈希表 Continued


很多同学都对于上一期我们post的HashMap的讲解很感兴趣,我们只是cover了最基本的问题,事实上Java 8对于HashMap的性能做了一定程度的提高。

http://www.nurkiewicz.com/2014/04/hashmap-performance-improvements-in.html?m=1

这是一篇非常好的Blog简洁的介绍了Java 8对于HashMap的提高,

Key take aways

  1. When collision happens, instead of using a LinkedList, a Tree is used to cut of the search time from O(N) to O(lgN)
  2. Reducing the collision costs will also prevent DOS in a way so that attackers can not send lots of hash-collision requests to slow down the servers. 



Saturday, June 14, 2014

[电面基础]Java基于Map/Hashtable的问题讨论



电面基础:Java基于Map/Hashtable的问题讨论


前言:电话面试(简称电面)以知识点考察为主,在数据结构方面,对于Map/Hashtable的考察绝对是必不可少的一部分。Java作为一门被业界广为应用的语言,在集合类方面有非常全面成熟的解决方案,因此我们用一个专题的形式来讨论在Java中有关Map/Hashtable题和解答。


通常在面试中,面试官会以比较简单的问题开始,比如


“Have you used HashMap before?”  or “What is HashMap? why do we use it?”


简单问题的背后主要考察candidate是否了解和使用过Hash相关的数据结构。那么根据HashMap在java中的实现,以下几点应该在回答中准确的表达:
  • Hash相关的数据结构本质上都是key value pair;
  • Hash中不能存在duplicate key;
  • HashMap提供非常快速查找时间复杂度;
  • 在HashMap具体实现中,Null可以作为key或者value存在;
  • HashMap不是线程安全;


由于Java中和Hash相关的类比较多,我们帮助大家梳理一下,比如


“What’s the difference between Hashtable and HashMap?”


回到最基本的知识,Java中有一个非常重要的Interface叫做Map,其定义: An object that maps keys to values. A map cannot contain duplicate keys; each key can map to at most one value. 像我们平时讨论Hash时经常提到的方法”get(), put(), containsKey(), entrySet()”等其实都是定义在Map这个interface中。而Hashtable和HashMap都实现了Map interface,这是他们最大的共同点。


包子tip: 在我们培训和面试的过程中,很多同学认为Map等同于HashMap,这其实是很不准确的。Map定义了key-value pair操作的基本接口;而HashMap只是Map的一种Hash实现,事实上还有很多其他的数据结构,比如TreeMap,用完全不同的方式实现Map interface。不同的Map实现方式都有相对应的使用场景,之后我们也会聊到。在面试中使用正确的terminology是很重要的细节,如果不注意会非常影响面试官的印象。


具体对于Hashtable和HashMap的不同之处,主要有以下几点:
  • HashMap不提供线程安全,而Hashtable是线程安全;同样,在单线程环境下,HashMap的性能要比Hashtable好一些;(对于有线程安全需求的场景,可以使用Collections.synchronizedMap(HashMap) 来返回一个线程安全的HashMap)
  • HashMap允许Null作为key或者value, Hashtable不允许;
  • HashMap不保证内部元素的存储顺序;(遍历时无序)
  • HashMap中的iterator是fast-fail iterator,Hashtable的不是;(具体讨论请参考下节Iterater/Emumeration)


接下来,问题就开始针对Hash的细节,比如


"How HashMap works in Java?” or "How does get()/put() method of HashMap works in Java?"


HashMap自然是基于Hash实现的key-value pair。具体实现,一个HashMap在内部会维护一个key-value类型(Map.Entry<K, V>) 的数组。在get()/put()中,它使用定义好的hash()算出key object在内部数组中的索引,然后在get()/put()中做相应的存取。如果HashMap中的元素能够相对均匀的分布在数组中的时候,get()/put()是可以达到非常好的常数时间复杂度。


再细致一些,对于非String类型的key,hash()使用key的hashCode()计算出该key-value pair对应在数组的索引(String类型的key有另一套计算hash()的方法,再次不做赘述),在get()方法中会使用key的equal()方法判断数组中的<key, value>中的key是否与传入的key相等。由此可见,在HashMap的使用中,key类型中定义的hashCode()和equal()方法都是非常重要的。假设key内部的值发生变化,导致hashCode()/equal()的结果改变,那么该key在HashMap中的存取则会产生问题。

再接下来,问一下exception case,比如


"What will happen if two different objects have same hashcode?” or "How will you retrieve Value object  if two keys will have same hashcode?”


如上一个问题所说,hash()会直接使用key的hashCode()结果,如果两个key的hashCode()结果相同,那么意味着他们对应的<key, value>被hash到数组中的同一个索引位置,产生所谓的hash collision。处理hash collision,主要有separate chain和open address两种方法。HashMap中使用的是separate chain的方法,即把所有map到同一个数组位置的元素用链表的形式存储,这样的后果就是查询的复杂度会相对上升低(因为需要遍历链表)。值得一提的是,每一个HashMap的Entry中都存有Key, Value, Pointer to next Entry, hashValueOfKey(想一想为什么)。

问一些failure case,比如


“What happens On HashMap in Java if the size of the HashMap  exceeds a given threshold defined by load factor?” or “do you see any problem with resizing of HashMap  in Java?”


Load factor(default to 75%)和Initial capacity(default to 16)是HashMap表的两个重要属性,如果HashMap中entry数量超过了threshold(loadfactor *capacity) ,那么HashMap就不得不扩充capacity(否则hash collison发生的概率就会大大增加,导致整个HashMap性能下降),扩充capacity是一件比较麻烦的事情,因为数组的连续性,HashMap不得不开辟一块更大数组,还要把原来的entries全部transfer到新的数组中,在某些情况下还需要重新计算key的hash()结果。另一方面,HashMap的capacity也不是越大越好,事实上HashMap的遍历本质上是基于内部数组的遍历,如果内部数组是无意义的大,那么遍历HashMap相对来说不是特别高效。