最新消息:

Python 并行任务技巧

Python admin 4514浏览 0评论

英文原文:Parallelism in one line

Python的并发处理能力臭名昭著。先撇开线程以及GIL方面的问题不说,我觉得多线程问题的根源不在技术上而在于理念。大部分关于Pyhon线程和多进程的资料虽然都很不错,但却过于细节。这些资料讲的都是虎头蛇尾,到了真正实际使用的部分却草草结束了。

传统例子

DDG https://duckduckgo.com/搜索“Python threading
tutorial”关键字,结果基本上却都是相同的类+队列的示例。

标准线程多进程,生产者/消费者示例:

08160811_wtir1这里是代码截图,如果用其他模式贴出大段代码会很不美观。文本模式点这里 here

Mmm.. 感觉像是java代码

在此我不想印证采用生产者/消费者模式来处理线程/多进程是错误的— 确实没问题。实际上这也是解决很多问题的最佳选择。但是,我却不认为这是日常工作中常用的方式。

问题所在

一开始,你需要一个执行下面操作的铺垫类。接着,你需要创建一个传递对象的队列,并在队列两端实时监听以完成任务。(很有可能需要两个队列互相通信或者存储数据)

Worker越多,问题越大.

下一步,你可能会考虑把这些worker放入一个线程池一边提高Python的处理速度。下面是
IBM tutorial 上关于线程较好的示例代码。这是大家常用到的利用多线程处理web页面的场景

08160812_4SMR1 Seriously, Medium. Fix your code support. Code is Here.

感觉效果应该很好,但是看看这些代码!初始化方法、线程跟踪,最糟的是,如果你也和我一样是个容易犯死锁问题的人,这里的join语句就要出错了。这样就开始变得更加复杂了!

到现在为止都做了些什么?基本上没什么。上面的代码都是些基础功能,而且很容易出错。(天啊,我忘了写上在队列对象上调用task_done()方法(我懒得修复这个问题在重新截图)),这真是性价比太低。所幸的是,我们有更好的办法.

引入:Map

Map是个很酷的小功能,也是简化Python并发代码的关键。对那些不太熟悉Map的来说,它有点类似Lisp.它就是序列化的功能映射功能. e.g.

urls = [', ']
results = map(urllib2.urlopen, urls)

这里调用urlopen方法,并把之前的调用结果全都返回并按顺序存储到一个集合中。这有点类似

results = []
for url in urls: 
    results.append(urllib2.urlopen(url))

Map能够处理集合按顺序遍历,最终将调用产生的结果保存在一个简单的集合当中。

为什么要提到它?因为在引入需要的包文件后,Map能大大简化并发的复杂度!

 

08160813_O30g

支持Map并发的包文件有两个:

Multiprocessing,还有少为人知的但却功能强大的子文件 multiprocessing.dummy. .

Digression这是啥东西?没听说过线程引用叫dummy的多进程包文件。我也是直到最近才知道。它在多进程的说明文档中也只被提到了一句。它的效果也只是让大家直到有这么个东西而已。这可真是营销的失误!

Dummy是一个多进程包的完整拷贝。唯一不同的是,多进程包使用进程,而dummy使用线程(自然也有Python本身的一些限制)。所以一个有的另一个也有。这样在两种模式间切换就十分简单,并且在判断框架调用时使用的是IO还是CPU模式非常有帮助.

 

准备开始

准备使用带有并发的map功能首先要导入相关包文件:

from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool

然后初始化:

pool = ThreadPool()

就这么简单一句解决了example2.py中build_worker_pool的功能. 具体来讲,它首先创建一些有效的worker启动它并将其保存在一些变量中以便随时访问。

pool对象需要一些参数,但现在最紧要的就是:进程。它可以限定线程池中worker的数量。如果不填,它将采用系统的内核数作为初值.

一般情况下,如果你进行的是计算密集型多进程任务,内核越多意味着速度越快(当然这是有前提的)。但如果是涉及到网络计算方面,影响的因素就千差万别。所以最好还是能给出合适的线程池大小数。

pool = ThreadPool(4) # Sets the pool size to 4

如果运行的线程很多,频繁的切换线程会十分影响工作效率。所以最好还是能通过调试找出任务调度的时间平衡点。

好的,既然已经建好了线程池对象还有那些简单的并发内容。咱们就来重写一些example2.py中的url opener吧!

08160819_Efiu1

看吧!只用4行代码就搞定了!其中三行还是固定写法。使用map方法简单的搞定了之前需要40行代码做的事!为了增加趣味性,我分别统计了不同线程池大小的运行时间。

08160820_0Y5k1

结果:

08160820_Bajx1

效果惊人!看来调试一下确实很有用。当线程池大小超过9以后,在我本机上的运行效果已相差无几。

示例 2:

生成上千张图像的缩略图:

现在咱们看一年计算密集型的任务!我最常遇到的这类问题之一就是大量图像文件夹的处理。

其中一项任务就是创建缩略图。这也是并发中比较成熟的一项功能了。

基础单线程创建过程

08160821_MHMc

作为示例来说稍微有点复杂。但其实就是传一个文件夹目录进来,获取到里面所有的图片,分别创建好缩略图然后保存到各自的目录当中。

在我的电脑上,处理大约6000张图片大约耗时27.9秒.

如果使用并发map处理替代其中的for循环:

08160821_ICaI

只用了5.6 !

就改了几行代码速度却能得到如此巨大的提升。最终版本的处理速度还要更快。因为我们将计算密集型与IO密集型任务分派到各自独立的线程和进程当中, 这也许会容易造成死锁,但相对于map强劲的功能,通过简单的调试我们最终总能设计出优美、高可靠性的程序。就现在而言,也别无它法。

好了。来感受一下一行代码的并发程序吧。

转自:http://www.oschina.net/translate/python-parallelism-in-one-line?p=2#comments

——————————————————————————————————————————

Parallelism in one line

Python has a terrible rep when it comes to its parallel processing capabilities. Ignoring the standard arguments about its threads and the GIL (which are mostly valid), the real problem I see with parallelism in Python isn’t a technical one, but a pedagogical one. The common tutorials surrounding Threading and Multiprocessing in Python, while generally excellent, are pretty “heavy.” They start in the intense stuff, and stop before they get to the really good, day-to-day useful parts.

Traditional Example

A quick survey of the top DDG results for “Python threading tutorial” shows that just about every single one of them gives the same Class + Queue based example.

The de-facto, intro to threading/multiprocessing, producer/Consumer example code:

Image, because Medium still kind of sucks for posting large code sections. Text is available here

Mmm.. Smell those Java roots.

Now, I don’t want to give the impression that I think the Producer / Consumer way of handling threading/multiprocessing is wrong — because it’s definitely not. In fact it is perfect for many kinds of problems. However, what I do think is that it’s not the most useful for day-to-day scripting.

The Problems (as I see them)

For one, you need a boiler-plate class in order to do anything useful. Secondly, you’ll need to maintain a Queue through which you can pipe objects, and to top if all off, you’ll need methods on both ends of the pipe in order to do the actual work (likely involving another queue if you want to communicate two ways or store results).

More workers, more problems.

From here, next thing you’d likely do is make a pool of those worker classes in order to start squeezing some speed out of your Python. Below is a variation of the example code given in the excellent IBM tutorial on threading. It’s a very common scenario in which you spread the task of retrieving web pages across multiple threads.

Seriously, Medium. Fix your code support. Code is Here.

Works like a charm, but look at all that code! Now we’ve got setup methods, lists of threads to keep track of, and worst of all, if you’re anywhere as dead-lock prone as I am, a bunch of join statements to issue. And It only gets more complex from here!

What’s been accomplished so far? A whole lotta nothin. Just about everything in the above code is pure plumbing. It’s boiler-plate-y, It’s error prone (Hell, I even forgot to call task_done() on the queue object while writing this (I’m too lazy to take fix it and take another screenshot)), and it’s a lot of work for little payoff. Luckily, there’s a much better way.


Introducing: Map

Map is a cool little function, and the key to easily injecting parallelism into your Python code. For those unfamiliar, map is something lifted from functional languages like Lisp. It is a function which maps another function over a sequence. e.g.

urls = ['<a href="http://www.yahoo.com%27/" target="_blank" rel="nofollow">http://www.yahoo.com'</a>, '<a href="http://www.reddit.com%27/" target="_blank" rel="nofollow">http://www.reddit.com'</a>]
results = map(urllib2.urlopen, urls)

This applies the method urlopen, on each item in the passed in sequence and stores all of the results in a list. It is more or less equivalent to

results = []
for url in urls: 
    results.append(urllib2.urlopen(url))

Map handles the iteration over the sequence for us, applies the function, and stores all of the results in a handy list at the end.

Why does this matter? Because with the right libraries, map makes running things in parallel completely trivial!

Parallel versions of the map function are provided by two libraries: multiprocessing, and also its little known, but equally fantastic step child: multiprocessing.dummy.

Digression: What’s that? Never heard of the threading clone of multiprocessing library called dummy? I hadn’t either until very recently. It has all of ONE sentence devoted to it in the multiprocessing documentation page. And that sentence pretty much boils down to “Oh yeah, and this thing exists.” It’s tragically undersold, I tell you!

Dummy is an exact clone of the multiprocessing module. The only difference is that, whereas multiprocessing works with processes, the dummy module uses threads (which come with all the usual Python limitations). So anything that applies to one, applies to the other. It makes it extremely easy to hop back and forth between the two. Which is especially great for exploratory programming when you’re not quite sure if some framework call is IO or CPU bound.

Getting Started

To access the parallel versions of the map functions the first thing you need to do is import the modules that contain them:

from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool

and instantiate their Pool objects in the code:

pool = ThreadPool()

This single statement handles everything we did in the seven line build_worker_pool function from example2.py. Namely, It creates a bunch of available workers, starts them up so that they’re ready to do some work, and stores all of them in variable so that they’re easily accessed.

The pool objects take a few parameters, but for now, the only one worth noting is the first one: processes. This sets the number of workers in the pool. If you leave it blank, it will default to the number of Cores in your machine.

In the general case, if you’re using the multiprocessing pool for CPU bound tasks, more cores equals more speed (I say that with a lot of caveats). However, when threading and dealing with network bound stuff, things seem to vary wildly, so it’s a good idea to experiment with the exact size of the pool.

pool = ThreadPool(4) # Sets the pool size to 4

If you run too many threads, you’ll waste more time switching between then than doing useful work, so it’s always good to play around a little bit until you find the sweet spot for the task at hand.

So, now with the pool objects created, and simple parallelism at our fingertips, let’s rewrite the url opener from example2.py!

Look at that! The code that actually does work is all of 4 lines. 3 of which are simple bookkeeping ones. The map call handles everything our previous 40 line example did with ease! For funzies, I timed both approaches as well as different pool sizes.

Results:

Pretty awesome! And also shows why it’s good to play around a bit with the pool size. Any pool size greater than 9 quickly lead to diminishing returns on my machine.

Real World Example 2:

Thumbnailing thousands of images

Let’s now do something CPU bound! A pretty common task for me at work is manipulating massive image folders. One of those transformations is creating thumbnails. It is ripe for being run in parallel.

The basic single process setup

A little hacked together for example, but in essence, a folder is passed into the program, from that it grabs all of the images in the folder, then finally creates the thumbnails and saves them to their own directory.

On my machine, this took 27.9 seconds to process ~6000 images.

If we replace the for loop with a parallel map call:

5.6 seconds!

That’s a pretty massive speedup for only changing a few lines of code. The production version of this is even faster by splitting cpu and io tasks into their own respective processes and threads — which is usually a recipe for deadlocked code. However, due to the explicit nature of map, and the lack of manual thread management, it feels remarkably easy to mix and match the two in a way that is clean, reliable, and easy to debug.

So there it is. Parallelism in (almost) one line.

 

转载请注明:jinglingshu的博客 » Python 并行任务技巧


Warning: Use of undefined constant PRC - assumed 'PRC' (this will throw an Error in a future version of PHP) in /usr/share/nginx/html/wp-content/themes/d8/comments.php on line 17
发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址