Python系列：subprocess模組管理子進程

Background photo by Artturi Jalli on Unsplash

撰寫程式專案時，常需要在父進程下創建子進程進行支線任務，例如在shell script(.sh)中，會在 linux command 的基礎下，包含 python script, R script 的執行，透過串聯輸出\輸入檔案的方式，產生目標內容。

同理我們也能在以 python 做為父進程的情況下，以 subprocess 模組協助執行外部指令(子進程)、串接資料輸出\輸入的管道、並取得返回值，本文將介紹 subprocess 模組管理子進程的方法與程式碼。

Part1: subprocess 套件引入

subprocess 為 Python 內建模組，引入即可使用

1
2


import subprocess
from subprocess import PIPE, Popen

Part2: 基本函數: run() and Popen()

subprocess 啟動子進程的方式有兩種 subprocess.run() 和 subprocess.Popen 其中run()在大部分情況下都可使用，Popen() 則使用於更進階的底層操作兩函數的基本參數都一樣，下面以 run() 來做參數介紹:

subprocess.run()

外部指令的參數有兩種輸入方式，但建議以list形式輸入: 可以使用 shlex 套件協助將指令 string 切分成 list 注意: shell = True 代表允許系統調用shell執行，如同讓主機門戶大開、容易發生 shell injection 的資安風險，請盡量少用

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


# in linux shell
ln -s
# in python subprocess run
## method1: input as list (recommend!)

## 徒手拆分成list
subprocess.run(['ln', '-s'])

## 用shlex協助拆分成list
cmd_split = shlex.split('ln -s')
subprocess.run(cmd_split)

## method2: input as string, with shell = True
subprocess.run('ln -s', shell = True)

default 情況下，run() 執行無誤 (returncode = 0) 後會返回 subprocess.CompletedProcess 的實例並印出結果；如果想查看返回狀況，可加上check = True ，當 returncode != 0 會抛出 CalledProcessError
default 情況下，run() 將結果直接印出，若想抓取輸出並存成變數，可使用 capture_output = True
default 輸出形式為 bytes，可用 text = True 來更改輸出形式，透過下面例子來比較:

1
2
3
4
5
6
7


a = subprocess.run(['pwd'], capture_output = True)
print(a.stdout)
> b'/current/working/directory\n'

a = subprocess.run(['pwd'], capture_output = True, text = True)
print(a.stdout)
> /current/working/directory

subprocess.Popen()

大部分的參數跟 run() 相同，特別的是只有 Popen() 可使用 PIPE 在串聯子進程間的 stdout, stdin, stderr，並搭配 communicate()防止死瑣，詳細內容在Part3 注意: run()不能使用 PIPE 和 communicate()

Part3: 以 PIPE 串聯進程間的輸入\ 輸出

PIPE 觀念相當於 linux 裡的 piping |，可減少中間檔案儲存的冗餘。下面的例子利用 PIPE 將第一個子進程的輸出，當作第二個子進程的輸入

1
2


cmd1 = subprocess.Popen(['pwd'], stdout = PIPE)
subprocess.Popen(['ls', '-alh'], stdin = cmd1.stdout)

PIPE 的實用程度高，但當管道內暫存的資料量過大，會造成子進程卡住，永遠無法結束(稱作死瑣)，使用 communicate() 可即時讀取 PIPE 內容:

1
2
3


e = subprocess.Popen(['pwd'], stdout = PIPE, stdin = PIPE, text = True)
out, err = e.communicate() 
print(out)

另外也可將 PIPE 的輸出存進檔案，解決死瑣問題

1
2


f = open('save_stdout.txt', 'w')
subprocess.Popen(['pwd'], stdout = f, text = True)

Part4: 實例應用

Example1:

我想檢查幾份log檔是否有回報錯誤訊息，利用 PIPE 加上 communicate() 將子進程 grep 到的內容存進 out 變數、並判斷是否有錯誤訊息

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


import shlex
import subprocess
from subprocess import PIPE, Popen

logs = ['log.1', 'log.2']

for i in logs:
    cmd = shlex.split('grep "terminated with an error exit status" {}'.format(i))
    p = subprocess.Popen(cmd, stdout = PIPE, text = True)
    out,err = p.communicate()
    
    if out: ## NOT NULL
    sys.exit('Error occured, please refer to {} for more detail'.format(i))

Example2:

現在有一份vcf原始檔，需要先以bio-tool(using Singularity)做格式轉換，再用python做資料篩選，但我不想產生轉換格式的中間檔，於是寫了以下 python script，以原始vcf做input，先開一個子進程處理vcf，以PIPE儲存輸出，存進 pyton 變數並作後續篩選

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


import sys
import shlex
import subprocess
from subprocess import PIPE, Popen

origin_vcf = sys.argv[1]

cmds = shlex.split('singularity exec vcflib_1.0.0-rc2.sif vcf2tsv {}'.format(origin_vcf))

input_vcf = subprocess.run(cmds, stdout = PIPE).stdout.decode('utf-8').splitlines()

for lines in input_vcf: #post-filtering in python

參考資料

https://docs.python.org/zh-tw/3/library/subprocess.html# https://www.aikaiyuan.com/4705.html